Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Our AI Was Making Up Data For Months And Nobody Caught It, Here’s What I’ve Learned

Came across a post here recently about someone who trusted an AI tool to handle their analytics, only to find out it had been hallucinating metrics and calculations the whole time. No one on their team had the background to spot it, so it went unnoticed until real damage was done.

Honestly, I’ve watched this happen with people I’ve worked with too. The tool gets treated as a source of truth rather than a starting point, and without someone who understands the basics of how the data is being processed, the errors just pile up quietly.

The fix isn’t complicated, you don’t need a dedicated data scientist. You just need someone who can sanity-check the outputs, understand roughly how the model is arriving at its numbers, and flag when something looks off.

Has anyone here dealt with something like this? Curious how your teams handle AI oversight for anything data-sensitive.

submitted by /u/ansh17091999
[link] [comments]

SIDD Dataset Question, Trying To Find Validation Subset

Hello everyone!

I am a Master’s student currently working on my dissertation project. As of right now, I am trying to develop a denoising model.

I need to compare the results of my model with other SOTA methods, but I have ran into an issue. Lots of papers seem to test on the SIDD dataset, however i noticed that it is mentioned that this dataset is split into a validation and benchmark subset

I was able to make a submission on Kaggle for the benchmark subset, but I also want to test on the validation dataset. Does anyone know where I can find it? I was not able to find any information about it on their website, but maybe I am missing something.

Thank you so much in advance.

submitted by /u/veganmkup
[link] [comments]

Causal Ability Injectors – Deterministic Behavioural Override (During Runtime)

I have been spending a lot of time lately trying to fix agent’s drift or get lost in long loops. While most everyone just feeds them more text, I wanted to build the rules that actually command how they think. Today, I am open sourcing the Causal Ability Injectors. A way to switch the AI’s mindset in real-time based on what’s happening while in the flow.

[ Example:
during a critical question the input goes through lightweight rag node that dynamically corresponds to the query style and that picks up the most confident way of thinking to enforce to the model and keeping it on track and prohibit model drifting]

[ integrate as retrieval step before agent, OR upsert in your existing doc db for opportunistical retrieval, OR best case add in an isolated namespace and use as behavioral contstraint retrieval]

[Data is already graph-augmented and ready for upsertion]

You can find the registry here: https://huggingface.co/datasets/frankbrsrk/causal-ability-injectors And the source is here: https://github.com/frankbrsrkagentarium/causal-ability-injectors-csv

How it works:

The registry contains specific mindsets, like reasoning for root causes or checking for logic errors. When the agent hits a bottleneck, it pulls the exact injector it needs. I added columns for things like graph instructions, so each row is a command the machine can actually execute. It’s like programming a nervous system instead of just chatting with a bot.

This is the next link in the Architecture of Why. Build it and you will feel how the information moves once you start using it. Please check it out; I am sure it’s going to help if you are building complex RAG systems.

Agentarium | Causal Ability Injectors Walkthrough

1. What this is

Think of this as a blueprint for instructions. It’s structured in rows, so each row is the embedding text you want to match against specific situations. I added columns for logic commands that tell the system exactly how to modify the context.

2. Logic clusters

I grouped these into four domains. Some are for checking errors, some are for analyzing big systems, and others are for ethics or safety. For example, CA001 is for challenging causal claims and CA005 is for red-teaming a plan.

3. How to trigger it

You use the

trigger_condition 

If the agent is stuck or evaluating a plan, it knows exactly which ability to inject. This keeps the transformer’s attention focused on the right constraint at the right time.

4. Standalone design

I encoded each row to have everything it needs. Each one has a full JSON payload, so you don’t have to look up other files. It’s meant to be portable and easy to drop into a vector DB namespace like

causal-abilities 

5. Why it’s valuable

It’s not just the knowledge; it’s the procedures. Instead of a massive 4k-token prompt, you just pull exactly what the AI needs for that one step. It stops the agent from drifting and keeps the reasoning sharp.

It turns ai vibes, to adaptive thought , through retrieved hard-coded instruction set.

State A always pulls Rule B.
Fixed hierarchy resolves every conflict.
Commands the system instead of just adding text.

Repeatable, traceable reasoning that works every single time.

Take Dataset and Use It, Just Download It and Give It To Ur LLM for Analysis

I designed it for power users, and If u like it, give me some feedback report,

This is my work’s broader vision, applying cognition when needed, through my personal attention on data driven ability.

frank_brsrk

submitted by /u/frank_brsrk
[link] [comments]

Knowledge Graph Datasets Extracted From FTX Collapse Articles And Giuffre V. Maxwell Depositions

I used sift-kg (an open-source CLI I built) to extract structured knowledge graphs from raw documents. The output includes entities (people, organizations, locations, events), relationships between them, and evidence text linking back to source passages — all extracted automatically via LLM.

Two datasets available:

– FTX Collapse — 9 news articles → 431 entities, 1,201 relations. https://juanceresa.github.io/sift-kg/ftx/graph.html

– Giuffre v. Maxwell — 900-page deposition → 190 entities, 387 relations. https://juanceresa.github.io/sift-kg/epstein/graph.html

Both are available as JSON in the repo. The tool that generated them is free and open source — point it at any document collection and it builds the graph for you: https://github.com/juanceresa/sift-kg

Disclosure: sift-kg is my project — free and open source.

submitted by /u/garagebandj
[link] [comments]

Dataset: January 2026 Beauty Prices In Singapore — SKU-Level Data By Category, Brand & Product (Sephora + Takashimaya)

I’ve been tracking non-promotional beauty prices across major retailers in Singapore and compiled a January 2026 dataset that might be useful for analysis or projects.

Coverage includes:

  • SKU-level prices (old vs new)
  • Category and subcategory classification
  • Brand and product names
  • Variant / size information
  • Price movement (%) month-to-month
  • Coverage across Sephora and Takashimaya Singapore

The data captures real shelf prices (excluding temporary promotions), so it reflects structural pricing changes rather than sale events.

Some interesting observations from January:

  • Skincare saw the largest increases (around +12% on average)
  • Luxury brands drove most of the inflation
  • Fragrance gift sets declined after the holiday period
  • Pricing changes were highly concentrated by category

I built this mainly for retail and pricing analysis, but it could also be useful for:

  • consumer price studies
  • retail strategy research
  • brand positioning analysis
  • demand / elasticity modelling
  • data visualization projects

Link in the comment.

submitted by /u/IntelligentHome2342
[link] [comments]

[self-promotion] Built A Startup Funding Tracker For Founders, Analysts & Investors

Keeping up with startup funding, venture capital rounds, and investor activity across news + databases was taking too much time.

So I built a simple Funding Tracker API that aggregates startup funding data in one place and makes it programmatic.

Useful if you’re:

• tracking competitors

• doing market/VC research

• building fintech or startup tools

• sourcing deals or leads

• monitoring funding trends

Features:

• latest funding rounds

• company + investor search

• funding history

• structured startup/VC data via API

Would love feedback or feature ideas.

https://rapidapi.com/shake-chillies-shake-chillies-default/api/funding-tracker

submitted by /u/Capable_Atmosphere_7
[link] [comments]

Historical Identity Snapshot/ Infrastructure (46.6M Records / Parquet)

Making a structured professional identity dataset available for research and commercial licensing.

46.6M unique records from the US technology sector. Fields include professional identity, role classification, classified seniority (C-Level through IC), organization, org size, industry, skills, previous employer, and state-level geography.

2.7M executive-level records. Contact enrichment available on a subset.

Deduplicated via DuckDB pipeline, 99.9% consistency rate. Available in Parquet or DuckDB format.

Full data dictionary, compliance documentation, and 1K-record samples available for both tiers.

Use cases: identity resolution, entity linking, career path modeling, organizational graph analysis, market research, BI analytics.

DM for samples and data dictionary.

submitted by /u/Cryptogrowthbox
[link] [comments]

Ranking The S&P 500 By C-level Turnover

I built a research tool and used it to read filings and press releases for the S&P 500 (502 companies) searching for CEO/CFO departures over the last decade. Sharing it as a resource both for the public data, but because the methodology of the tool itself can be applied to any dataset.

Starbucks was actually near the top of the list with 11 C-suite departures. And then there’s a set of companies, including Nvidia and Garmin which haven’t seen any C-level exec turnover in the last 10yrs.

submitted by /u/MathematicianBig2071
[link] [comments]

Seeking Star Rating Data Sets With Counts, Not Average Score

I have trouble finding data sets of ratings, such as star ratings for movies from1 to 5 stars, where the data consists of the count for each star. E.g. 1-star: 1 vote, 2-stars: 44 votes, 3 -stars: 700 votes, 4-stars: 803 votes, 5-stars: 101 votes. I’m not interested in data sets that only contain the resulting average star score.

It does not need to be star ratings, but data that gives a distribution of the ratings, like absolute category ratings. Could also be probabilities/counts for a set of categories.

Here’s a more scientific example: https://database.mmsp-kn.de/koniq-10k-database.html where people rated perceived image quality of many images on a five point scale.

submitted by /u/hageldave
[link] [comments]

Help Needed On Health Insurance Carrier Dataset | Consulting Market Research

Hey all, Does anyone have suggestions for the most exhaustive, reputable, and usable data sources to understand the entire US health insurance market, to be used in consulting-type market research? I.e., a list of all health insurance carriers, states they cover, member lives, claims volume, types of insurance offered, and funding source? Understandably, there are a lot of half-sources out there. I’ve looked at NAIC, Definitive HC, and other sources but wanted to ‘ask the experts’ here. I know that the top brand names are going to make up 90%+ of the covered lives, but I’m trying to be holistic and exhaustive in my work. Thank you!

submitted by /u/Assignment_Fuzzy
[link] [comments]

[Self-promotion] R/datasets Is Where It All Started For My Startup

I’m one of the founders of databar.ai . About 4 years ago we posted here with an idea we called a “no-code API marketplace” (link to original post) at the time.

The comments and DMs we got from this community were basically our first real validation and and it pushed us to actually build. This is probably where our journey started.
Since then we’ve pivoted a few times (APIs → connectors → today it’s more of a GTM/data-enrichment product), but the original idea is still the same: make it easier to turn data on the web into clean, usable tables and automations.

Anyways, not here to sell, mostly to say thanks to everyone who supported us! If you share what you’re working on, I’m happy to respond / swap notes.

submitted by /u/Fun-Ant-5808
[link] [comments]

Looking For High-fidelity Clinical Datasets For Validating A Healthcare Prototype.

Hey everyone,

​I’m currently in the dev phase of a system aimed at making healthcare workflows more systematic for frontline workers. The goal is to use AI to handle the “heavy lifting” of data organization to reduce burnout and human error.

​I’ve been using synthetic data for the initial build, but I’ve hit the point where I need real-world complexity to test the accuracy of my models. Does anyone have recommendations for high-fidelity, de-identified patient datasets?

​I’m specifically looking for data that reflects actual hospital dynamics (vitals, lab timelines, etc.) to see how my prototype holds up against realistic clinical noise. Obviously, I’m only looking for ethically sourced/open-research databases.

​Any leads beyond the basic Kaggle sets would be huge. Thanks!

submitted by /u/sylenix
[link] [comments]

[PAID] Looking For Rights-cleared Datasets For Commercial AI Use

Hey everyone —

I work on data partnerships at Shutterstock and I’m looking to connect with people who own (or represent) datasets that are available for commercial licensing.

This is for paid, legitimate AI training use — not scraping, not academic-only, and nothing with unclear rights.

We’re generally interested in:

  • Speech/audio datasets (multi-language, conversational, accents, etc.)
  • Image or video datasets
  • Domain-specific text/data (healthcare, finance, retail, industrial, etc.)
  • Multimodal datasets with solid metadata

No synthetic datasets.

What matters most:

  • You own the data or have the rights to license it
  • Commercial redistribution is possible
  • It’s meaningful in scale (not small personal projects)

If that’s you, feel free to DM me with a quick overview and we can take it from there. Happy to answer questions here too.

Appreciate it 🙏

submitted by /u/polyphemus12
[link] [comments]

I/B/E/S Needed For Analyst Coverage Data

Hi, we are 2 masterstudents from Belgium and in writing our master thesis we run into some problems regarding finding analyst coverage data. We have tried Compustat, CRSP, Datastream and capital IQ, for most of these we can find the data that we need but we run into some acces restrictions from our university. This data is absolute necessairy for our thesis so is there anyone who could share this with us? We are also very happy with other places we could look and with very good alternatives! Thanks in advance, 2 desperate students.

submitted by /u/saar309
[link] [comments]

What Are The Best Value For Money Flight APIs You Know?

Hi! I’m working on building my own flight search engine so I don’t have to spend hours searching manually.

The main advantage is custom filtering that I can’t apply on existing search engines, and I’m already getting results that are better than some of the tools currently on the market.

That said, the more data I can pull, the better the results will be—so I have a couple of questions:

  • What free flight APIs do you know that offer a generous or unlimited request quota?
  • What are the best “bang for the buck” flight APIs you’ve used? (Considering price per request and the size/quality of the data pool.)

Thanks!

submitted by /u/sprinkledino
[link] [comments]

Using TRAC-1 Or TRAC-2 For Cyberbullying Detection

Hello! I am going to make a model which is going to be trained on cyberbullying detection. I was wondering if the TRAC-1 or TRAC-2 datasets would be fit for this? Considering that the datasets (I think at least) do not contain cyberbullying labels (i.e., cyberbullying, not cyberbullying) would it be fitting to kind of do that non aggressive text is “not cyberbullying” while aggressive text is cyberbullying?

I was also wondering if the dataset is not fitting, is there some other known dataset I can use? I am also writing a master thesis about this, so I can not use any dataset.

Any help and tips are appriciated!

submitted by /u/AffectWizard0909
[link] [comments]

Epstein Graph: 1.3M+ Searchable Documents From DOJ, House Oversight, And Estate Proceedings With AI Entity Extraction

[Disclaimer: I created this project]

I’ve created a comprehensive, searchable database of 1.3 million Epstein-related documents scraped from DOJ Transparency Act releases, House Oversight Committee archives, and estate proceedings.

The dataset includes:
– Full-text search across all documents
– AI-powered entity extraction (238,000+ people identified)
– Document categorization and summarization
– Interactive network graphs showing connections between entities
– Crowdsourced document upload feature

All documents were processed through OpenAI’s batch API for entity extraction and summarization. The site is free to use.

Tech stack: Next.js + Postgres + D3.js for visualizations

Check it out: https://epsteingraph.com

Feedback is appreciated, I would especially be interested in thoughts on how to better showcase this data and correlate various data points. Thank you!

submitted by /u/indienow
[link] [comments]

[R] SNIC: Synthesized Noise Dataset In RAW + TIFF Formats (6000+ Images, 4 Sensors, 30 Scenes)

[Disclosure: This is my paper and dataset]

I’m sharing my paper and dataset from my Columbia CS master’s project. SNIC (Synthesized Noisy Images using Calibration) provides images with calibrated, synthesized noise in both RAW and TIFF formats. The code and dataset are publicly available.

**Paper:** https://arxiv.org/abs/2512.15905

**Code:** https://github.com/nikbhatt-cu/SNIC

**Dataset:** https://doi.org/10.7910/DVN/SGHDCP

## The Problem

Advanced denoising algorithms need large, high-quality training datasets. Physics-based statistical noise models can generate these at scale, but there’s limited published guidance on proper calibration methods and few published datasets using well-calibrated models.

## What’s Included

This public dataset contains 6000+ images across 30 scenes with noise from 4 camera sensors:

– iPhone 11 Pro (main and telephoto lenses)

– Sony RX100 IV

– Sony A7R III

Each scene includes:

– Full ISO ranges for each sensor

– Both RAW (.DNG) and processed (.TIFF) versions

## Validation

I validated the calibration approach using two metrics:

**Noise realism (LPIPS):** Our calibrated synthetic noise achieves comparable LPIPS to real camera noise across all ISO levels. Manufacturer DNG models show significantly worse performance, especially at high ISO (up to 15× worse LPIPS).

**Denoising performance (PSNR):** I applied NAFNet to denoise real noisy images, SNIC synthesized images, and images synthesized using DNG noise models. Images denoised from our calibrated synthetic noise achieved superior PSNR compared to those from DNG-based synthetic noise.

## Why It Matters

SNIC provides both the methodology and dataset for building properly calibrated noise models. The dual RAW/TIFF format enables work at multiple stages of the imaging pipeline. All code and data is publicly available.

Happy to answer questions about the methodology, dataset, or results!

submitted by /u/NikBhatt
[link] [comments]

Looking For A Dataset Of Healthy Drink Recipes (non-alcoholic/diet-oriented)

Hi everyone! I’m working on a small project and need a dataset specifically for healthy drink recipes. Most of what I’ve found so far is heavily focused on cocktails and alcoholic beverages.

I’m looking for something that covers smoothies, juices, detox drinks, or recipes tailored to specific diets (keto, low-carb, vegan, etc.). Does anyone know of any open-source datasets or APIs that might fit? Thanks in advance!

submitted by /u/danyakrivolap
[link] [comments]