Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Does Anyone Have An Excel-Based Case Study For An Accounting Competition?

Hi everyone!

I know that this is a bit of an ask but I’m currently helping organize a school competition for undergraduate accounting students, and we’re currently looking for an Excel-based case study that we could use for the event.

Ideally, it would include: A dataset in Excel that participants can use as raw data. Questions or tasks requiring analysis or computations in Excel Topics related to accounting, finance, or business analysis

If possible, it would also help if there’s a sample expected output or reference solution to guide the evaluation.

This is a student-led initiative, so unfortunately we’re unable to provide any compensation, but If anyone has existing Excel case studies, teaching materials, datasets with questions, or knows where we could find something like this, I’d really appreciate the help. We would be very grateful for any materials, resources, or guidance you could share.

Hoping for your kind consideration and thank you so much!

submitted by /u/Noctis-Aeternae
[link] [comments]

[Free Dataset] 1 Million+ Industrial MRO & Scientific Equipment Metadata (Harvard/Mendeley)

Hi everyone,

I’m sharing a large-scale metadata archive we’ve built at QTE Technologies. It contains over 1,000,000 records of industrial products (MRO) and scientific instruments.

We believe this is a valuable resource for training industrial LLMs and supply chain research.

Access the data here:

License: CC BY 4.0. Looking forward to seeing how the community uses this!

submitted by /u/Heavy_Guitar_7428
[link] [comments]

[Synthetic][self-promotion]Released A Synthetic Multimodal PHI De-identification Benchmark: Streaming Audit Log With 5 Policy Comparisons

Most PHI datasets evaluate masking on static single-modality documents. This one is different.

It captures per-event masking decisions across a simulated longitudinal stream, the same subject appearing across clinical notes, ASR transcripts, imaging proxies, waveform data, and audio metadata over time. The idea is to evaluate how re-identification risk accumulates across events rather than within a single record.

Five policies are included for comparison: raw, weak, pseudo, redact, and adaptive. The adaptive controller is the interesting one, it escalates masking strength only when cumulative exposure actually justifies it.

Dataset is fully open, no DUA required. Everything runs on synthetic data, no real patient records anywhere.

Hugging Face: https://huggingface.co/datasets/vkatg/streaming-phi-deidentification-benchmark

Code to regenerate: https://github.com/azithteja91/phi-exposure-guard

Happy to answer questions on the schema or the benchmark design.

submitted by /u/Visual_Music_4833
[link] [comments]

Help Me To Diversify My Research Data

I am doing a research project on Influence of digital financial resources on financial understanding of young adults aged 18-24, but my data is too male dominated please help me to diversify the data with female and other options

This is for academic purpose and will only take 1 ot 2 min to fill out.

submitted by /u/Moonandtheearth8
[link] [comments]

I Built An ESG Data API Covering 500+ Global Companies — Free Tier Available

I just made Hey everyone, I’ve been working on an ESG Data API and just launched it publicly.

It covers 500+ publicly traded companies across the US, Europe, and Asia-Pacific and includes:

  • Overall ESG scores broken down by Environmental, Social, and Governance pillars
  • 3 years of historical ESG data
  • Scope 1, 2, and 3 carbon emissions
  • Sustainability framework disclosures (GRI, SASB, CDP, TCFD)
  • Company screener — filter by ESG score, sector, country

Built it because ESG data is either locked behind expensive Bloomberg/Refinitiv terminals or scattered across inconsistent PDF reports. Wanted to make it accessible for developers, researchers, and fintech builders.

Free tier available. Would love feedback from anyone building in the sustainability or finance space.

Link: https://rapidapi.com/YounesFiali/api/esg-data-api/playground/apiendpoint_7de59263-54c6-4fe7-af0a-5929ec98cee1

Disclaimer: I built this and am the developer behind it. Sharing here because I think it’s useful for the community — happy to answer any questions.

submitted by /u/Choice_Classroom_703
[link] [comments]

Need E Commerce Dataset With Size Of 5gb Atleast

Hi everyone,

I’m looking for a large e-commerce dataset (at least ~5GB) for a personal data engineering project. Ideally I’m hoping to find something with raw CSV files rather than already processed datasets.

The dataset could include things like:

  • orders
  • customers
  • products
  • order_items
  • payments / transactions
  • reviews or clickstream data (optional but nice to have)

I’m mainly trying to simulate a realistic transactional dataset for building a small data warehouse and running analytics queries.

Requirements:

  • Size: ~5GB or larger
  • Format: CSV preferred
  • Structure: multiple tables
  • Domain: e-commerce / retail

If you know any Kaggle datasets, public data dumps, GitHub repos, or open data sources that match this, please share.

Thanks!

submitted by /u/Historical-Web3638
[link] [comments]

How Does Your AI Team Source Training Data?

I need a favour from this group.

I’m deep in research on how AI teams actually source and license training data (text, audio, video, synthetic). Not the theory, but real, messy, day-to-day process.

I’m NOT pitching or selling anything. I’m having short 15-minute conversations with people who work on this daily, and the insights have been genuinely eye-opening.
Happy to share what I’m learning in return.

If you know someone who fits any of these, I’d massively appreciate an intro or a tag in the comments.

Possible targets:
ML engineers or data leads at companies training or fine-tuning LLMs.
Anyone responsible for sourcing or procuring training data.
Teams building domain-specific AI models (healthcare, legal, finance, speech) People working on multilingual model training

submitted by /u/Winter-Lake-589
[link] [comments]

[PAID] Everyone’s Posting AI Garbage So I Built Tools To Scrape The Data From It And Give It To You Guys

Spent the last few weeks building scrapers for the major AI tools directories. If everyone’s gonna over-hype this slop, the data should be useful.

What I scraped:

  • Futurepedia: 1,302 tools
  • TAAFT (There’s An AI For That): 6,248 tools
  • TopAI: 1,880 tools
  • MCP Server Directory: 10,614 servers

20,044 entries total. Clean CSVs with categories, pricing, ratings, links, whatever each site had.

Disclosure: this is paid data.

Doing anything with AI tools data? Building something? Just want to poke around? DM me.

submitted by /u/krisco65
[link] [comments]

Small Favor: Could You Share A Grocery Receipt For A Project I’m Building?

Hi everyone,

I’m working on a small project that tries to read grocery receipts and automatically categorize the items (milk → dairy, apples → produce, etc).

The surprisingly hard part is that every store prints receipts differently. Walmart, Tesco, Costco, Aldi, and others all have their own formats, abbreviations, tax layouts, loyalty sections, and discount lines.

To make the parser reliable, I need a few real examples of receipts from different stores.

If you happen to have a receipt from one of these stores, it would help a lot if you could share one.

Examples of stores I’m currently looking for include:

US: Walmart, Kroger, Costco, Whole Foods, Target, Publix, Trader Joe’s, Aldi

Canada: Loblaws / No Frills, Costco, Sobeys, Walmart

UK: Tesco, Sainsbury’s, Asda, Aldi, Lidl

Australia: Woolworths, Coles

Singapore: FairPrice / NTUC

Switzerland: Migros, Coop

Japan: Aeon / MaxValu, Ito-Yokado

South Korea: E-Mart, Homeplus

What works best:

• a quick photo of the receipt

• a scanned receipt

• a digital/email receipt

You can blur or crop anything personal like card numbers or addresses. The only parts I really need are:

• the store name/header

• item lines

• prices

• tax/discount sections

Even one receipt helps because each retailer has its own format.

If you’re willing to help, you can:

• post an image here

• DM me

• share an Imgur / Google Drive link

I’d really appreciate it. And once the parser is in good shape, I’m happy to share the dataset and parsing rules with the community as well.

Thanks for helping a nerdy little project learn how to read grocery receipts 🙂

submitted by /u/Sanju-05
[link] [comments]

Looking For Retail Sales Dataset For A Marketing Data Analysis Project

I am looking for a moderate to large dataset containing retail customer order data, some sort of customer demographic data, product details and reviews if possible. I know there’s probably not some single dataset that contains all these at the same place so any suggestions on what datasets i can combine or what to look for is also welcome. I had already seen the posts in this sub regarding this and asked chatgpt for help but what it came up with was vague to say the least. I just want a some suggestions on how to proceed on the dataset aspect for my project on retail consumer behaviour analysis that i want to do where i want to analyse and find out how external factors such as trends, weather, media perceptions, etc., contribute to consumer behaviour and sales patterns.

Any suggestions are welcome. Again TIA.

submitted by /u/Su0ma0nt7a
[link] [comments]

Built A Tool To Generate + QC Custom Datasets For LLM Training (dedupe, Schema Validation, Split Integrity). What Makes You Trust A Dataset?

I’m working on a dataset toolchain aimed at LLM fine-tuning datasets, because I noticed most dataset failures aren’t “model problems”—they’re data problems: duplicates, leakage, unclear labels, inconsistent formatting, or missing documentation.

What the tool enforces

  • Schema validation: every record must match a strict schema (fields, allowed labels, structure)
  • Split integrity: supports splitting by topic/template-family so train/test don’t leak via shared scaffolding
  • Dedupe + repetition control: catches exact and near-duplicates; flags templated collapse
  • QC reports: acceptance rate, failure breakdown, and example-level rejection reasons

What I’m trying to get right (and want feedback on)

  • What metadata is a must-have for you? (license, lineage, schema, label definitions, known limitations)
  • Do you prefer datasets shipped as clean-only, or raw + clean + reproducible pipeline?
  • How do you want near-duplicate removal described so you trust it didn’t delete useful diversity?

If people are interested, I can share a dataset-card template + QC report structure that’s been working well (no links unless allowed).

submitted by /u/JayPatel24_
[link] [comments]

When Did You Realize Standard Scraping Tools Weren’t Enough For Your AI Workloads?

We started out using a mix of lowcode scraping tools and browser extensions to supply data for our AI models. That worked well during our proof-of-concept, but now that we’re scaling up, the differences between sources and frequent schema changes are creating big problems down the line.

Our engineers are now spending more time fixing broken pipelines than working with the data itself. We’re considering custom web data extraction, but handling all the maintenance in-house looks overwhelming. Has anyone here fully handed this off to a managed partner like Forage AI or Brightdata?

I’d really like to know how you managed the switch and whether outsourcing your data operations actually freed up your engineers’ time.

submitted by /u/3iraven22
[link] [comments]

What’s Running Across 350K+ Sites (September 2025 – January 2026)

I’ve been fingerprinting what’s been running on the internet since September, right down to the patch version too. Just chucked a slice of what I’ve found on GitHub.

The schema for the dataset is available in the README file. It’s all JSON files, so you’d be able to easily dig through it using just about any programming language on the planet.

If you find something real cool from this data let me know, I want to see what you can do.

submitted by /u/Upper-Character-6743
[link] [comments]

Working On A Low-cost Sign Language Recognition System For Hearing-impaired Students — Need Advice On Collecting Datasets

Hi everyone,

I’m a computer science student currently working on a project called 𝐒𝐢𝐠𝐧𝐁𝐫𝐢𝐝𝐠𝐞, an AI-powered accessible learning platform designed to improve classroom communication for hearing-impaired students.

The main goal of the project is to build a 𝐥𝐢𝐠𝐡𝐭𝐰𝐞𝐢𝐠𝐡𝐭 𝐬𝐢𝐠𝐧 𝐥𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐫𝐞𝐜𝐨𝐠𝐧𝐢𝐭𝐢𝐨𝐧 𝐬𝐲𝐬𝐭𝐞𝐦 𝐭𝐡𝐚𝐭 𝐜𝐚𝐧 𝐫𝐮𝐧 𝐨𝐧 𝐥𝐨𝐰-𝐜𝐨𝐬𝐭 𝐝𝐞𝐯𝐢𝐜𝐞𝐬 (𝐧𝐨𝐫𝐦𝐚𝐥 𝐥𝐚𝐩𝐭𝐨𝐩𝐬 𝐰𝐢𝐭𝐡𝐨𝐮𝐭 𝐆𝐏𝐔𝐬) so that it could realistically be deployed in schools.

Current approach:

– MediaPipe Holistic for hand + pose landmark extraction

– Landmark normalization

– Random Forest classifier for sign prediction

– FastAPI backend + React frontend

– Real-time webcam input

The system currently supports 𝐛𝐚𝐬𝐢𝐜 𝐰𝐨𝐫𝐝-𝐥𝐞𝐯𝐞𝐥 𝐬𝐢𝐠𝐧 𝐝𝐞𝐭𝐞𝐜𝐭𝐢𝐨𝐧 and includes a 𝐜𝐥𝐚𝐬𝐬𝐫𝐨𝐨𝐦 𝐦𝐨𝐝𝐞 𝐟𝐨𝐫 𝐛𝐢𝐝𝐢𝐫𝐞𝐜𝐭𝐢𝐨𝐧𝐚𝐥 𝐜𝐨𝐦𝐦𝐮𝐧𝐢𝐜𝐚𝐭𝐢𝐨𝐧

– Student signs → converted to text

– Teacher speech → converted to live captions

Right now the biggest limitation is 𝐝𝐚𝐭𝐚𝐬𝐞𝐭 𝐬𝐢𝐳𝐞. I only have a small set of labeled sign images/videos, which makes it difficult to expand vocabulary or experiment with temporal models.

I’m looking for advice on a few things:

  1. 𝐃𝐚𝐭𝐚𝐬𝐞𝐭𝐬 𝐟𝐨𝐫 𝐈𝐧𝐝𝐢𝐚𝐧 𝐒𝐢𝐠𝐧 𝐋𝐚𝐧𝐠𝐮𝐚𝐠𝐞 (𝐈𝐒𝐋) or similar landmark-based sign datasets.
  2. Best ways to 𝐜𝐨𝐥𝐥𝐞𝐜𝐭 𝐚 𝐬𝐦𝐚𝐥𝐥 𝐛𝐮𝐭 𝐮𝐬𝐞𝐟𝐮𝐥 𝐝𝐚𝐭𝐚𝐬𝐞𝐭 for word-level or classroom-related signs.
  3. Suggestions for improving the model while keeping it 𝐥𝐢𝐠𝐡𝐭𝐰𝐞𝐢𝐠𝐡𝐭 𝐞𝐧𝐨𝐮𝐠𝐡 𝐭𝐨 𝐫𝐮𝐧 𝐨𝐧 𝐂𝐏𝐔 𝐝𝐞𝐯𝐢𝐜𝐞𝐬.
  4. Any feedback on the system design or architecture.

Eventually I’d like to extend it toward 𝐬𝐞𝐪𝐮𝐞𝐧𝐭𝐢𝐚𝐥 𝐰𝐨𝐫𝐝 𝐝𝐞𝐭𝐞𝐜𝐭𝐢𝐨𝐧 𝐨𝐫 𝐬𝐢𝐦𝐩𝐥𝐞 𝐬𝐞𝐧𝐭𝐞𝐧𝐜𝐞-𝐥𝐞𝐯𝐞𝐥 𝐢𝐧𝐭𝐞𝐫𝐚𝐜𝐭𝐢𝐨𝐧, but still keep it deployable on low-resource hardware. Currently this is done by the react side like when users sign it stores the sequence of words.

If anyone has worked on sign language recognition, accessibility tools, or dataset collection, I’d really appreciate your suggestions.

Thanks

submitted by /u/Agile_Commission1099
[link] [comments]

What Metadata Do You Wish Every Dataset Shipped With (so It’s Actually Usable)?”

  • I’m packaging a dataset for ML training and want to do this “properly.”
  • What fields make you trust a dataset fast? (license, data lineage, schema, label definitions, splits, leakage checks, etc.)
  • Any examples of dataset cards/docs you consider “gold standard”? (Keep it discussion + best practices; avoid sales. r/datasets discourages low-effort requests and prefers original sources.)

submitted by /u/JayPatel24_
[link] [comments]

Cleaned JSON Version Of The USDA Phytochemical / Ethnobotanical Database

Hey everyone.
I recently needed to use Dr. Duke’s Phytochemical database for a project, but the raw CSV dumps from the USDA are an absolute nightmare to parse (missing fields, inconsistent naming, random caps lock everywhere).

I spent the last couple of days completely cleaning, normalizing, and mapping the dataset into a relational JSON structure so it’s actually usable for data science pipelines.

I put a sample of 400 fully mapped chemical/plant entities on GitHub if anyone else needs this for their research. Saved me a ton of headache.
[https://github.com/wirthal1990-tech/USDA-Phytochemical-Database-JSON]

submitted by /u/DoubleReception2962
[link] [comments]

I Built A Small Experiment To Collect A Longitudinal Dataset Of Gemini’s Stock Predictions

For ~38 days, a cronjob generated daily forecasts:

•⁠ ⁠10-day horizons

•⁠ ⁠~30 predictions/day (different stocks across multiple sectors)

•⁠ ⁠Fixed prompt and parameters

Each run logs:

•⁠ ⁠Predicted price

•⁠ ⁠Natural-language rationale

•⁠ ⁠Sentiment

•⁠ ⁠Self-reported confidence

Because the runs were captured live, this dataset is time-locked and can’t be recreated retroactively.

### Platform

I built a simple MVP to explore the data interactively:

https://glassballai.com

https://glassballai.com/results

You can browse and crawl all recorded runs here

https://glassballai.com/dashboard

### Goal

This is not a trading system or financial advice.

The goal is to study how LLMs behave over time under uncertainty:

forecast stability, narrative drift, confidence calibration, and prompt-conditioned bias.

### Dataset

After ~1.5 months, I’m publishing the full dataset on Hugging Face.

It includes forecasts, rationales, sentiment, and confidence.

(Actual prices are rehydratable due to licensing.)

https://huggingface.co/datasets/louidev/glassballai

###Stats:

Stocks with most trend matches: ADBE (29/38), ISRG (28/39), LULU (28/39)

Stocks with most trend misses: AMGN (31/38), TXN (28/38), PEP (28/39)

Feedback and critique welcome.

submitted by /u/aufgeblobt
[link] [comments]

I Built A Small Experiment To Collect A Longitudinal Dataset Of Gemini’s Stock Predictions

For ~38 days, a cronjob generated daily forecasts:

•⁠ ⁠30 stocks across different sectors

•⁠ ⁠10-day horizons

•⁠ ⁠~30 predictions/day (different stocks across multiple sectors)

•⁠ ⁠Fixed prompt and parameters

Each run logs:

•⁠ ⁠Predicted price

•⁠ ⁠Natural-language rationale

•⁠ ⁠Sentiment

•⁠ ⁠Self-reported confidence

Because the runs were captured live, this dataset is time-locked and can’t be recreated retroactively.

### Platform

I built a simple MVP to explore the data interactively:

https://glassballai.com

https://glassballai.com/results

You can browse and crawl all recorded runs here

https://glassballai.com/dashboard

.

### Goal

This is not a trading system or financial advice.

The goal is to study how LLMs behave over time under uncertainty:

forecast stability, narrative drift, confidence calibration, and prompt-conditioned bias.

### Dataset

After ~1.5 months, I’m publishing the full dataset on Hugging Face.

It includes forecasts, rationales, sentiment, and confidence.

(Actual prices are rehydratable due to licensing.)

https://huggingface.co/datasets/louidev/glassballai

###Stats:

Stocks with most trend matches: ADBE (29/38), ISRG (28/39), LULU (28/39)

Stocks with most trend misses: AMGN (31/38), TXN (28/38), PEP (28/39)

Feedback and critique welcome.

submitted by /u/aufgeblobt
[link] [comments]