Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

[OC] Open Dataset: Retail BTC Buy Cost Benchmark Across 10 Countries (card/bank Rails, CC-BY-4.0)

I published an open dataset for cross-country retail BTC buy cost benchmarking.

Scope:

– 10 countries

– card and bank rails

– $100 BTC baseline slice

– snapshot-backed benchmark outputs

Core links:

– Report: https://augea.io/reports/retail-crypto-cost-benchmark-2026-q2

– Methodology: https://augea.io/methodology/retail-crypto-cost-benchmark-v1

– Data appendix: https://augea.io/data/reports/retail-crypto-cost-benchmark-2026-q2

Direct files:

– benchmark-pack.json

– claim-gate.json

– country-rail-benchmark.csv

– country-card-vs-bank-delta.csv

License: CC-BY-4.0 (attribution only)

If useful, I can add additional derived slices in the same schema. Feedback on schema/data usability is welcome.

submitted by /u/pharrison99
[link] [comments]

Where Can We Find Real-time Banking Transaction Datasets For A Kafka-based Fraud Detection Project?

Hey everyone,

I’m currently doing an internship with a team of 6, and we’re working on a data engineering project focused on big data. The goal is to build a system that processes real-time streaming bank transactions using Kafka, with an added focus on fraud detection and prediction.

Right now, we’re struggling with one main issue: where can we find large-scale, real-time (or realistically simulated) financial transaction data?

Most datasets we’ve found so far are static and not really suitable for real-time streaming or fraud detection scenarios.

If anyone has recommendations—whether it’s datasets, APIs, synthetic data generators, or even approaches to simulate streaming financial data for fraud detection—we’d really appreciate the help.

Thanks in advance!

submitted by /u/No-Big-4463
[link] [comments]

We Benchmarked 18 LLMs On OCR (7k+ Calls) — Cheaper/old Models Oftentimes Win. Full Dataset + Framework Open-sourced.

TLDR; We were overpaying for OCR, so we compared flagship models with cheaper and older models by creating a new, curated dataset including standard documents you’d find in real-world industry.

We’ve been looking at OCR / document extraction workflows and kept seeing the same pattern:

Too many teams are either stuck in legacy OCR pipelines, or are overpaying badly for LLM calls by defaulting to the newest/ biggest model.

We put together a curated set of 42 standard documents and ran every model 10 times under identical conditions; 7,560 total calls. Main takeaway: for standard OCR, smaller and older models match premium accuracy at a fraction of the cost.

We track pass^n (reliability at scale), cost-per-success, latency, and critical field accuracy.

All documents are non-redacted due to synthetic data. Yet, all documents are real-world representative because their information density is similar, only the actual data content is synthetic.

  • Invoices
  • Transport orders
  • Bills of Lading
  • Receipts (from CORU dataset)

Dataset Hugginface: https://huggingface.co/datasets/Timokerr/OCR_baseline

Benchmark Harness Repo: https://github.com/ArbitrHq/ocr-mini-bench

Curious whether this matches what others here are seeing.

submitted by /u/TimoKerre
[link] [comments]

I Do A Lot Of Web Crawling And Put Together A Sample Dataset Of Companies And Their Tech Stacks

I’ve been messing around with web scraping for a while (mostly extracting data on what software websites are running under the hood).

I decided to clean up some of the data and open-source a sample dataset of 500 companies mapped to the tech they use (Stripe, React, Shopify, AWS, etc.). It’s in CSV/JSON.

It’s not a massive dataset by any means, but I figured it might be handy if anyone here needs some real-world data for a side project, practicing pandas/data analysis, or testing out your own scripts without having to build a scraper from scratch.

Repo is here: https://github.com/leadita/tech-stack-datasets

submitted by /u/haynajjar
[link] [comments]

Network Topology Diagram Datasets For LLMs With Vision Capabilities

Hi, I would like to have some images of different network topologies varying from simple buss topologies to complex actual networks. Anyone know about a suitable dataset containing such diagrams?
This is for my project where I will be testing LLMs with vision capabilities for there ability to spot faulty network topologies, perhaps the topologi is dependent on one device not going down, or a server should be moved to a DMZ. Something like that. appreciate all feedback.

submitted by /u/ThaLazyLand
[link] [comments]

B2B Lead Dataset – Where To Find It?

Hi all! i’m looking for a dataset with companies and employees data, i’d like to use it in a small startup, offering such data to people who would like to contact those companies and employees. Apollo and all the alternatives does not let you “sell” their info.. do you know any provider that lets you resale? Thank you

submitted by /u/ghiro12
[link] [comments]

Most Health Apps Collect Your Data… Is That Really Necessary?

Disclosure: this is a self promotion post.

I’ve been noticing that a lot of health and habit apps require accounts and store personal data in the cloud — even for something as simple as tracking medication.

That feels unnecessary, especially for something so sensitive.

So I built a medication tracker that works completely offline:

no login

no data collection

everything stays on your phone

https://play.google.com/store/apps/details?id=com.vnytalab.carebell

I’m trying to keep it as simple and private as possible.

Would love some honest feedback on this approach — do you actually care about privacy in apps like this, or is convenience more important for you?

submitted by /u/Renpa09
[link] [comments]

I Built A Synthetic Data Generator, And I’d Love To Get Your Thoughts! [Synthetic]

Hey guys, I’m Adipooj, and over the course of a few months, my buddy and I built a synthetic data generator, that generates customisable datasets for credit card transactions with fraud injected in them, for use in ML, AI Training, Validation, and most importantly Model Testing!

If this is something that interests you, shoot me a DM, I’d love to send you a sample and get your thoughts on it!

submitted by /u/Adipooj
[link] [comments]

Definitive Healthcare Datasets (US Healthcare)

I’m looking for US healthcare contact datasets that cover CXOs and IT decision makers. Specifically, I’m interested in records that may include roles like CIO, CTO, VP of IT, Director of IT, CMIO, CEO, COO, and other relevant decision-makers across hospitals, health systems, clinics, medical groups, and related healthcare organizations.

If you have something relevant, pls reply or DM with the details like coverage, last updated date, asking price, etc.

submitted by /u/spiritual-stock5469
[link] [comments]

African Countries: A Curated Dataset On Africa Indicators For Education And Data Science

Initial release of the African Countries Indicators dataset v1.0.0

https://zenodo.org/records/19647480

  • Initial release of the African Countries Indicators dataset v1.0.0 54 sovereign African nations
  • 10 variables: geographic, demographic, and administrative indicators
  • Formats: CSV and XLSX
  • Sources: World Bank, World Atlas, ISO, Google Developers
  • African Countries Indicators DataSet

submitted by /u/renzocrossi
[link] [comments]

Offering Agentic SDLC Dataset (full Execution Traces + Code Evolution) In Exchange For Evaluation / Results

I’ve been building a system that generates fully instrumented agentic SDLC traces, and I’m looking for a few serious folks to evaluate it and share results.

Not selling anything here — I’m interested in whether this actually moves model behavior in practice.

What the dataset includes (per “packet”):

  • Full agent execution trace (JSONL audit log)
  • Inline action protocol (custom XML-style commands, also normalized to R1 <|TOOL_CALL|> format)
  • Reinference loops (action → result → next action preserved)
  • Complete project source code
  • Full file evolution history (create/edit/delete with snapshots)
  • SQLite DB with structured tables (runs, tool calls, plans, etc.)
  • Precomputed embeddings (4096d, PII-sanitized)
  • Viewer + ETL tooling to load into your own stack
  • All generated with OSS models w/ verified licenses

Key difference vs typical datasets:
This isn’t just prompts → outputs. It’s:

Each project can be iterated:

  • v1: initial build
  • v2: bug fixes
  • v3: polish
  • v4: feature expansion
  • v5: integrations

So you get longitudinal behavior, not isolated samples.

What I’m looking for:

  • People fine-tuning models (1B–120B, LoRA or full SFT)
  • Agent / tool-use training experiments
  • Anyone doing evals on:
    • tool use correctness
    • code editing / repair
    • multi-step task completion

In exchange:
I’ll provide a dataset bundle (or multiple), and I’m asking for:

  • honest feedback
  • any measurable results (even rough)
  • what worked / didn’t
  • where the data helped or failed

No obligation to share publicly if you don’t want to — even private feedback is useful.

A few things I’m specifically curious about:

  • How much data (tokens) is needed to see behavioral shifts
  • Whether iteration sequences (build → fix → extend) actually help
  • Whether models learn better recovery behavior from failed traces
  • Impact on tool-call correctness / formatting

If you’re interested, comment or DM with:

  • what models you’re working with
  • what you’d want to test

Happy to tailor a dataset slice to your use case.

Would also appreciate any critique on the structure itself — trying to figure out if this is genuinely useful or just interesting.

submitted by /u/madheader69
[link] [comments]

Title: Need Guidance On Getting Real CT Brain Scan Datasets And Its Reports For Research Based Final Year University Project

I’m a final-year Software Engineering student working on my FYP.

My proposed project is an AI system for detecting abnormalities in brain CT scans For ( (Normal, hemorrhage, stroke, edema)

I need some guidance from people in the medical/AI/research field:

  • Where can I get real CT brain scan data sets
  • Are there any public datasets or institutions that provide this kind of medical imaging data?
  • What are the main challenges I should expect when working with this kind of data?

If anyone has experience with medical AI, radiology datasets, or hospital collaborations, your advice would really help me shape my project in the right direction.

submitted by /u/Azula691
[link] [comments]

570 Construction Software Tools Analyzed Across 15 Categories [OC]

I spent six months cataloging every construction software tool I could find and just open-sourced the aggregate data.

15 categories, 570 tools, columns for pricing model, mobile coverage, and company size targeting.

MIT license on the data, CC-BY on the analysis.

Some findings:

  • 55% of vendors hide their pricing behind a sales call. In Safety & Compliance the number climbs to 81%.
  • Only 45% have a mobile app. 83% of bidding tools are desktop-only.
  • 9% target solo operators.
  • 3 categories have zero options for one-person operations: Document Management, Field Management, and Safety & Compliance.

Happy to answer questions about methodology.

Disclosure: I also run ConTechFinder, the directory the data comes from.

submitted by /u/mc_mctools
[link] [comments]