Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Free Cross-Lingual Acoustic Feature Database For Tabular ML And Emotion Recognition

So I have a free to use 7 language macro prosody samole pack for the community to play with. I’d love feedback. No audio, voice telemetry on 7 languages, normalized, graded. Good to help make emotive TTS or benchmark less common languages, cross linguisic comparion etc.

90+ languages available for possible licensing.

https://huggingface.co/datasets/vadette/macro_prosody_sample_set

This pack was selected to span typologically distinct language families and speech types:

Korean is a language isolate with phrase-final focus marking and complex mora timing — a useful contrast to the stress-timed Indo-Aryan languages.

Hindi is the largest corpus here and provides strong statistical power for Indo-Aryan prosody baselines.

Hebrew is a VSO Semitic language with root-and-pattern morphology; the high metadata coverage makes it useful for demographic-stratified analyses.

Manx is a Celtic revival language with a tiny native speaker community. The 98% PRISTINE rate reflects the controlled recording conditions of motivated community contributors.

Tzeltal is a Mayan language with ergative-absolutive alignment and a distinctive tonal register system. It is rarely represented in acoustic datasets.

Maguindanao (SPS2) is spontaneous speech from a Philippine Austronesian language. The T2-heavy distribution reflects the naturalistic recording conditions of the SPS2 corpus.

Lasi (SPS2) is a Sindhi variety spoken in Balochistan. Shorter median clip duration (3.4s vs 5–6s for CV24 languages) reflects the spontaneous speech format.

submitted by /u/Wooden_Leek_7258
[link] [comments]

Building A Multi-turn, Time-aware Personal Diary AI Dataset For RLVR Training — Looking For Ideas On Scenario Design And Rubric Construction [serious]

Hey everyone,

I’m working on designing a training dataset aimed at fixing one of the quieter but genuinely frustrating failure modes in current LLMs: the fact that models have essentially no sense of time passing between conversations.

Specifically, I’m building a multi-turn, time-aware personal diary RLVR dataset — the idea being that someone uses an AI as a personal journal companion over multiple days, and the model is supposed to track the evolution of their life, relationships, and emotional state across entries without being explicitly reminded of everything that came before.

Current models are surprisingly bad at this in ways that feel obvious once you notice them. Thought this community might have strong opinions on both the scenario design side and the rubric side, so wanted to crowdsource some thinking.

submitted by /u/Over_Valuable_12
[link] [comments]

What If There Was A Extensive Relationship Compatibility Questionnaire (details In The First Comment) That Is Meant To Work As A Premptive And Predictive Diagnostic Report For Frictions In Relationship?

Hi everyone,

I’ve been studying relationship dynamics and friction points for a research proposal recently. While going through a lot of material and patterns around where couples struggle, I realized something interesting.

Many relationship issues aren’t sudden. They slowly build over time through misunderstandings, mismatched expectations, or different ways of handling stress and conflict.

While looking into this, I started working on something that’s basically ‘a very detailed relationship questionnaire’. Both partners would answer it separately, and the idea is to generate a kind of predictive and preemptive diagnostic report for the relationship.

The goal isn’t to judge the relationship or tell people whether they should stay together or not. It’s more about identifying things like:

• areas where partners naturally align • possible friction points • differences in expectations or emotional needs • places where misunderstandings could happen later

So couples can talk about these things earlier, instead of discovering them years down the road.

I’ll be honest about something too. I’ve never really been blessed with what many of you have here. A stable relationship with someone you care about is a pretty beautiful thing, and in some ways I’m a little jealous of it.

So this is partly curiosity and partly a hope that maybe tools like this could help people keep what they already have strong.

I wanted to ask people who are actually in relationships:

  1. Would you and your partner try something like this?

  2. Would you want to see the results if it pointed out possible future friction points?

  3. Is there something you wish you had understood earlier about your partner?

Just genuinely curious about how couples would feel about something like this.

(Questionnaire would be completely anonymous.)

submitted by /u/Additional_Fee1673
[link] [comments]

Reliable B2B Data Provider For Lead Generation (Verified Contacts & Decision-Makers)

Hi everyone,

I run a research team that helps lead generation agencies, sales teams, and B2B companies find accurate contact data for outreach and prospecting. If you’re doing cold email, LinkedIn outreach, or sales prospecting, we can help you with:

• Verified B2B contact databases • Decision-maker contact numbers • Professional email addresses • Industry-specific prospect lists • Targeted company databases (any industry, any region) • Custom lead lists based on your exact ICP

We focus on quality over bulk, so the goal is to give you usable contacts that actually help you book meetings and generate leads.

This works well for:

Lead generation agencies SDR teams Recruitment firms SaaS companies Marketing agencies B2B founders doing outbound

If you need targeted contacts for a specific industry, country, or job title, feel free to comment or send me a DM.

Happy to share more details and see if we can help.

Thanks!

submitted by /u/HelicopterNo8935
[link] [comments]

Butterflies & Moths Of Austria – Fine-grained Lepidoptera Dataset

I repackaged the Butterflies & Moths of Austria dataset to make it easier to use in ML workflows.

The dataset contains 541,677 images of 185 butterfly and moth species recorded in Austria, making it potentially useful for:

  • biodiversity ML
  • species classification
  • computer vision research

Hugging Face dataset:
https://huggingface.co/datasets/birder-project/butterflies-moths-austria

Original dataset (Figshare):
https://figshare.com/s/e79493adf7d26352f0c7

Credit to the original dataset creators and contributors 🙌
This Hugging Face version mainly reorganizes the data to make it easier to load and work with in ML pipelines (ImageFolder format).

submitted by /u/hassonofer
[link] [comments]

What Companies Provide Automated Web Scraping Of News Website?

I don’t want to build scrapers, then i have 2 options.

  1. Scraped News APIs & Aggregator: These platforms crawl millions of sources daily and serve you clean, structured data:Pre. Example: Webz.io, An enterprise-grade provider that scrapes millions of news sites, blogs, and forums daily. They provide highly granular filtering and historical data.
  2. Need to scrape niche, heavily protected sites or extract highly specific data points? go for Custom Web Scraping & AI Extraction Infrastructure. Example: Forage AI, they sit right at the intersection of Custom Web Scraping and AI-Powered Data Pipelines, catering heavily to enterprises and AI developers.

As a non-engineer these are the two options I can think of, open for suggestions.

submitted by /u/3iraven22
[link] [comments]

Starting A Small Project Exploring MIMIC-IV.

As a cardiology resident interested in clinical AI, my goal is to better understand how real ICU data can be used for predictive modeling. Current focus: • dataset exploration • variable understanding • data cleaning

Currently in the dataset exploration and cleaning phase. MIMIC is incredibly rich: thousands of ICU stays and hundreds of clinical variables — but turning raw hospital data into something usable for ML is not trivial.

My goal is simple: learn how clinical data can be transformed into predictive models for patient outcomes. Curious to hear from others who have worked with MIMIC or clinical ML.

submitted by /u/FrequentViolinist672
[link] [comments]

Customer Funnel Datasets Suggestion.

Hello. I have been looking for datasets for customer funnel analysis (for SQL-based analysis). I want to show my proficiency in data cleaning in SQL and analysis via this project. So, A dataset with null and duplicate values will be really effective, I believe. Any suggestions or resources?

submitted by /u/xudling_pong23
[link] [comments]

Make Your AI Assistant Behave, Not Just Sound Smart

Most AI assistants fail for a simple reason:
they were never trained for real product behavior.

We built DinoDS to fix that.

DinoDS is a production-grade training suite for teams building AI assistants that need to: • respond in a consistent tone
• follow strict output formats
• make better decisions about when to answer vs retrieve
• produce reliable structured outputs

Instead of generic data, DinoDS focuses on behavioral training for real AI workflows.

If you’re building serious AI products and want your models to behave reliably in production, let’s talk.

DM me if you want access.

submitted by /u/JayPatel24_
[link] [comments]

Has Anyone Used ThorData To Skip The Web Scraping Phase? Found Some Solid Structured Data For E-commerce/socials.

Recently I was working on a market research project and frankly, I was getting exhausted spending 80% of my time just maintaining web scrapers. Dealing with rotating residential proxies, CAPTCHAs, and sites constantly changing their DOM structure (looking at you, Amazon and TikTok) is a massive headache when you just want to get to the actual data analysis.

While looking for alternatives to building scrapers from scratch, I stumbled across a platform called Thordata (thordata.com/products/datasets). I spent some time digging into their docs and catalog, and it seems pretty interesting from an engineering/analytics standpoint.

While looking for alternatives to building scrapers from scratch, I stumbled across a platform called Thordata (thordata.com/products/datasets). I spent some time digging into their docs and catalog, and it seems pretty interesting from an engineering/analytics standpoint.

Basically, they handle the extraction and structuring from heavy anti-bot sites and serve it up ready to use. A few things that stood out to me:

  • Coverage: They have a pretty heavy focus on e-commerce (Amazon, Walmart, Shopee) and social media (TikTok, X, Instagram). They also have B2B stuff like LinkedIn and Crunchbase.
  • Delivery formats: This is what caught my eye. You can either get static datasets (good for training models or backtesting), or use their APIs to pull live data if you’re building a dashboard or tracking real-time prices/trends.
  • Cleanliness: The data fields (like product specs, reviews, social metrics) are already parsed into clean JSON/CSV, so it skips the whole regex/parsing step.

For me, the main appeal is just outsourcing the infrastructure pain. Not having to manage headless browsers or pay a premium for proxy networks just to get reliable e-commerce data is a huge time saver.

Has anyone here actually used them in a production environment? I’m curious to know:

  1. How is the API latency if you are using it for live feeds?
  2. How quickly do they update their schemas when these big platforms push major UI/backend updates?

Would love to hear your thoughts, or if you guys have other go-to alternatives for these specific sites (aside from just building it yourself). Cheers.

submitted by /u/Mammoth-Dress-7368
[link] [comments]

Dataset On Movies For My Explaratory Analysis

Hi guys , im thinking to present the movies dataset as part of my subject under data visualization , and explain the explaratory analysis i did on the data

But the lecturer has told that it should be like a story telling and not simoly stating the obvious points like for example ” top 20 movies of all time ” etc

Can anyone provide insights on how can i steer this dataset into a good storytelling point and also explore more with the data for the audience

Im seeing the generic datasets on kaggle abt them

If anyone has any other points or choosing a different dataset etc will be more helpful and hearing ur thoughts

I have to present just the stuff im visually plotting and not complete project , for the professor to check where i am at and take feedback to improve

submitted by /u/dishdash-paradox
[link] [comments]

SAP Data Anonymization For Research Project

Hey ya’ll, fresher here. I am working on an academic project (Enterprise analytics pipelines and BI systems) and exploring weather my company will remotely consider providing the data, and if this can be anonymized. Does anyone here have experience in anonymizing data ? if so, what are the ways to do that

E.g

  • Masking identifiers/ generating synthetic datasets from real distributions

submitted by /u/IamThat_Guy_
[link] [comments]

USDA Phytochemical Database Enriched With PubMed, ClinicalTrials.gov, ChEMBL, And USPTO Patent Counts — Free Sample Available

Posting a dataset I’ve been building for a while:

What it is: The USDA Dr. Duke’s Phytochemical and Ethnobotanical Databases, restructured into a single flat table and enriched with four external data sources.

Schema (8 columns):

  • chemical — compound name (USDA nomenclature)
  • plant_species — binomial species name
  • application — traditional medicinal use (where recorded)
  • dosage — reported effective dose or concentration
  • pubmed_mentions_2026 — total PubMed publication count
  • clinical_trials_count_2026ClinicalTrials.gov study count
  • chembl_bioactivity_count — ChEMBL bioassay data points
  • patent_count_since_2020 — USPTO patents since Jan 2020

Stats: 104,388 records, 24,771 unique compounds, 2,315 species.

Formats: JSON (~18 MB) and Parquet (~900 KB).

Free sample (400 rows, CC BY-NC 4.0): https://github.com/wirthal1990-tech/USDA-Phytochemical-Database-JSON

There’s also a quickstart Jupyter notebook in the repo if you want to run some DuckDB queries against the sample.

The full dataset is commercial (one-time license). The base USDA data is public domain; the enrichment work is what you’re paying for.

I built the dataset solo in Germany, server is a Hetzner VPS running PostgreSQL 15 and Python 3.12. Happy to answer methodology questions.

submitted by /u/DoubleReception2962
[link] [comments]

Advice On Distributing A Large Conversational Speech Dataset For AI Training?

Hi everyone,

I’m currently involved in a project where we are collecting large volumes of two-speaker conversational call audio intended for AI training purposes (speech recognition, conversational AI, etc.).

We’re trying to understand the best ways to distribute or license this kind of dataset to companies or research teams that need training data.

The recordings are:
• Natural phone-style conversations
• Two participants per recording
• Collected with consent
• PII removed
• Optional transcription and metadata available

I’m curious if anyone here has experience with:

  • selling or licensing speech datasets
  • platforms/marketplaces for AI training data
  • typical pricing per hour of conversational audio

Most information online is very vague, so hearing real experiences from people in the space would be really helpful.

Thanks!

submitted by /u/FaithlessnessWeak199
[link] [comments]

Edible Plants Of The World: Database

Hi people!

I’d like to share a personal project I’ve been working on, an Edible Plant Database:

Mods, I interpreted your rule as “Self-promotion(of a website/domain you work for or own) without disclosure will be removed” – So I believe this is fine to share, as I am disclosing I made it? Apologies if I misunderstood that rule. Just want to clarify, I make no money from this project, and it’s a small hobby/self-hosted database I never intend to commercialise or monetise in any way, it will always be free.

Recently, I was searching for some kind of database of edible plants around the world to add to my “prepper” library, and I came across this old post: https://old.reddit.com/r/preppers/comments/iedq94/catalogue_of_all_the_worlds_edible_plants/

Basically, it seemed to be exactly what I was looking for, but it’s a 5-year-old post, and unfortunately, none of the download links worked for me.

The original source is a guy named Bruce French: https://www.abc.net.au/news/2020-08-22/food-plant-solutions-malnutrition-farming-edible-plants/12580732

He still maintains his edible plant database here: https://foodplantsinternational.com/. It’s a fantastic resource; I encourage you to check it out.

The actual searchable database is here: https://fms.cmsvr.com/fmi/webd/Food_Plants_World – however, I was unable to find a bulk download, and the search interface is quite clunky/hard to navigate (I’m sure it was created a long time ago).

So, I decided to create a bit of an ADHD passion project for myself in my spare time. However, it’s got to the point where I thought I should give back to the community.

I decided to take Bruce’s amazing collection and package it in a modern Web UI and a Modern Search interface, so I created this website, The Edible Plant DB: https://edibleplantdb.org/. I’m a bit of an amateur web developer and like playing around with stuff like this in my spare time.

I did, however, decide to make some improvements along the way. Most of Bruce’s collection does have images of the plants; however, they were quite small (basically just thumbnail-sized), and I thought, well, if I’m making a prepper edible plant database, there should be clearer images for people trying to identify the plants. So I updated all the plant images in the database with images sourced from https://www.inaturalist.org/ and Wikipedia. I was able to find images for about 80% of the plants in the DB. But I still need to find images/better descriptions for the niche/uncommon species in the database.

I also went a bit over the top and turned it into a really basic form of a “Wiki”, each plant page has an edit button at the top, so anyone can make an edit, as well as contribute images for each plant (especially for the ones with no images): https://edibleplantdb.org/contribute

Then, in terms of packaging, I am a huge supporter of .ZIM files and the organisation Kiwix: it’s basically everything in one file and much more useful for offline browsing, instead of me just providing a DB file and a bunch of directories/files with images, etc.

You can download the torrent here: https://edibleplantdb.org/downloads – however, just a disclaimer, I literally just started seeding this torrent, so it’s going to be a bit slow, unless I get some support from the community to get the seeding going 🙂

Anyway! Let me know what you think!

PS: Still a work in progress, and I am sure my amateur code has some bugs waiting to be discovered!

Also Magnet link (for ZIM file): magnet:?xt=urn:btih:86cb9bd89b458e75dae4be6281ad5522561f6a8b&dn=edibleplantdb.zim&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Fexodus.desync.com%3A6969%2Fannounce

submitted by /u/tmosh
[link] [comments]

Structured Normalised Financial Data (financial Statements, Insider Transactions And 13-F Forms) Straight From The SEC

Hi everyone!

I’ve been working on a project to clean and normalize US equity fundamentals and filings as one thing that always frustrated me was how messy the raw filings from the SEC are.

The underlying data (10-K, 10-Q, 13F, Form 4, etc.) is all publicly available through EDGAR, but the structure can be pretty inconsistent:

  • company-specific XBRL tags
  • missing or restated periods
  • inconsistent naming across filings
  • insider transaction data that’s difficult to parse at scale
  • 13F holdings spread across XML tables with varying structures

I ended up building a small pipeline to normalize some of this data into a consistent format. The dataset currently includes:

  • normalized income statements, balance sheets and cashflow statements
  • institutional holdings from 13F filings
  • insider transactions (Form 4)

All sourced from SEC filings but cleaned so that fields are consistent across companies and periods.

The goal was to make it easier to pull structured data for feature engineering without spending a lot of time wrangling the raw filings.

For example, querying profitability ratios across multiple years:

/profitability-ratios?ticker=AAPL&start=2020&end=2025 

I wrapped it in a small API so it can be used directly in research pipelines or for quick exploration:

https://finqual.app

Hopefully people find this useful in their research and signal finding!

Disclaimer: This is a project I built. Sharing it here in case it’s useful for others looking for financial data

submitted by /u/myztaki
[link] [comments]

How Do You Handle Data Cleaning Before Analysis? Looking For Feedback On A Workflow I Built

I’ve been working on a mixed-methods research platform, and one thing that kept coming up from users was the pain of cleaning datasets before they could even start analysing them.

Most people were either writing Python/R scripts or doing it manually in Excel. Both of which break the workflow when you just want to get to the analysis.

So I built a data cleaning module directly into the analysis tool. It handles the usual stuff:

  • Duplicate removal (exact match or by specific columns)
  • Missing value handling (drop rows, fill with mean/median/mode/custom value, forward/backward fill)
  • Outlier detection (IQR and Z-score methods)
  • String cleaning (trim, case conversion)
  • Type conversion
  • Find & replace (with regex)
  • Row filtering by conditions

Each operation shows a preview with before/after diffs so you can review changes row by row before applying. There’s also inline cell editing for quick manual fixes and one-click undo.

Curious how others approach this:

  • Do you clean data in a separate tool or prefer it integrated into your analysis workflow?
  • What operations do you find yourself doing most often?
  • Anything obvious I’m missing?

Happy to share a link if anyone wants to try it out. Works with CSV, Excel, and SPSS files.

submitted by /u/Sensitive-Corgi-379
[link] [comments]