Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Definitive Healthcare Datasets (US Healthcare)

I’m looking for US healthcare contact datasets that cover CXOs and IT decision makers. Specifically, I’m interested in records that may include roles like CIO, CTO, VP of IT, Director of IT, CMIO, CEO, COO, and other relevant decision-makers across hospitals, health systems, clinics, medical groups, and related healthcare organizations.

If you have something relevant, pls reply or DM with the details like coverage, last updated date, asking price, etc.

submitted by /u/spiritual-stock5469
[link] [comments]

African Countries: A Curated Dataset On Africa Indicators For Education And Data Science

Initial release of the African Countries Indicators dataset v1.0.0

https://zenodo.org/records/19647480

  • Initial release of the African Countries Indicators dataset v1.0.0 54 sovereign African nations
  • 10 variables: geographic, demographic, and administrative indicators
  • Formats: CSV and XLSX
  • Sources: World Bank, World Atlas, ISO, Google Developers
  • African Countries Indicators DataSet

submitted by /u/renzocrossi
[link] [comments]

Offering Agentic SDLC Dataset (full Execution Traces + Code Evolution) In Exchange For Evaluation / Results

I’ve been building a system that generates fully instrumented agentic SDLC traces, and I’m looking for a few serious folks to evaluate it and share results.

Not selling anything here — I’m interested in whether this actually moves model behavior in practice.

What the dataset includes (per “packet”):

  • Full agent execution trace (JSONL audit log)
  • Inline action protocol (custom XML-style commands, also normalized to R1 <|TOOL_CALL|> format)
  • Reinference loops (action → result → next action preserved)
  • Complete project source code
  • Full file evolution history (create/edit/delete with snapshots)
  • SQLite DB with structured tables (runs, tool calls, plans, etc.)
  • Precomputed embeddings (4096d, PII-sanitized)
  • Viewer + ETL tooling to load into your own stack
  • All generated with OSS models w/ verified licenses

Key difference vs typical datasets:
This isn’t just prompts → outputs. It’s:

Each project can be iterated:

  • v1: initial build
  • v2: bug fixes
  • v3: polish
  • v4: feature expansion
  • v5: integrations

So you get longitudinal behavior, not isolated samples.

What I’m looking for:

  • People fine-tuning models (1B–120B, LoRA or full SFT)
  • Agent / tool-use training experiments
  • Anyone doing evals on:
    • tool use correctness
    • code editing / repair
    • multi-step task completion

In exchange:
I’ll provide a dataset bundle (or multiple), and I’m asking for:

  • honest feedback
  • any measurable results (even rough)
  • what worked / didn’t
  • where the data helped or failed

No obligation to share publicly if you don’t want to — even private feedback is useful.

A few things I’m specifically curious about:

  • How much data (tokens) is needed to see behavioral shifts
  • Whether iteration sequences (build → fix → extend) actually help
  • Whether models learn better recovery behavior from failed traces
  • Impact on tool-call correctness / formatting

If you’re interested, comment or DM with:

  • what models you’re working with
  • what you’d want to test

Happy to tailor a dataset slice to your use case.

Would also appreciate any critique on the structure itself — trying to figure out if this is genuinely useful or just interesting.

submitted by /u/madheader69
[link] [comments]

Title: Need Guidance On Getting Real CT Brain Scan Datasets And Its Reports For Research Based Final Year University Project

I’m a final-year Software Engineering student working on my FYP.

My proposed project is an AI system for detecting abnormalities in brain CT scans For ( (Normal, hemorrhage, stroke, edema)

I need some guidance from people in the medical/AI/research field:

  • Where can I get real CT brain scan data sets
  • Are there any public datasets or institutions that provide this kind of medical imaging data?
  • What are the main challenges I should expect when working with this kind of data?

If anyone has experience with medical AI, radiology datasets, or hospital collaborations, your advice would really help me shape my project in the right direction.

submitted by /u/Azula691
[link] [comments]

570 Construction Software Tools Analyzed Across 15 Categories [OC]

I spent six months cataloging every construction software tool I could find and just open-sourced the aggregate data.

15 categories, 570 tools, columns for pricing model, mobile coverage, and company size targeting.

MIT license on the data, CC-BY on the analysis.

Some findings:

  • 55% of vendors hide their pricing behind a sales call. In Safety & Compliance the number climbs to 81%.
  • Only 45% have a mobile app. 83% of bidding tools are desktop-only.
  • 9% target solo operators.
  • 3 categories have zero options for one-person operations: Document Management, Field Management, and Safety & Compliance.

Happy to answer questions about methodology.

Disclosure: I also run ConTechFinder, the directory the data comes from.

submitted by /u/mc_mctools
[link] [comments]

Looking For Contributors For LLM Response Annotation Dataset (research Project)

I’m a computer science student working on an independent research project studying how large language models respond to different prompt framings.

I’m building a dataset of annotated model responses and looking for a few contributors to help with labeling.

Task:

  • Read short LLM responses (2–5 lines)
  • Assign simple labels (agreement, reasoning quality, etc.)
  • No writing required, just structured selection

Setup:

  • Work is organized in small batches (50–100 samples at a time)
  • Clear rubric and examples provided
  • Focus is on consistency and quality

Contribution options:

  • You can contribute as a research collaborator and be acknowledged as an annotator in the paper
  • Alternatively, if you prefer not to be credited, a small payment per batch can be arranged

If you’re interested, comment and I’ll share a sample + details.

submitted by /u/lembodevil
[link] [comments]

Free Sample Of My 54K-vehicle Specs Dataset (cars, Trucks, Motorcycles) – Maybe Useful For Someone Here [PAID]

After a year of scraping + PDF parsing, I put together a fairly complete vehicle specs dataset. Sharing a free sample in case anyone here can use it for their work.

– 47,344 cars (108 brands, 1898–2026)

– 5,492 trucks (146 brands, 1960–2024, GVW/GCW, Euro III–VI, axle configs)

– 1,858 motorcycles (171 brands, 1902–2023, suspension/brake/ABS details)

– 40–50 spec fields per vehicle (engine, performance, dimensions,

features/equipment, fuel consumption, CO2, price when available)

– CSV + SQL + JSON formats

**Free sample** (100 cars + 50 trucks + 50 motos, real data, all columns):

https://api.carsdataset.com – click the green “Get Free Sample” button

There’s also a live search/filter demo on the same page if you want to poke around before downloading.

Paid full datasets start at $299 (motorcycles only) up to $999 (complete bundle), quarterly updates included.

**r/datasets community:** use code `REDDIT20` at checkout for 20% off (or DM me and I’ll send a code directly).

If anyone’s interested in a **resellable license** to redistribute within your own product (non exclusive), DM me – happy to chat about scope and pricing.

Questions or data-quality complaints very welcome – I’d rather fix the data than pretend it’s perfect.

submitted by /u/Ok_Cucumber_131
[link] [comments]

Found Several Major Benchmark Sets With Issues.

tl;dr: did lots of physics and feature extraction on benchmark audio deepfake datasets. Data shows thousands to tens of thousands of clips with incorrect or unreported audio compression reported as uncomopressed or ‘clean’ bonifide baselines.

So I ran a massive feature extraction on 20ish industry standard audio deepfake datasets. One of the more interesting findings was that for a bunch of very common sets like ASVspoof 2021, thousands to tens of thousands of files in their bonifide baseline sets do not match the provided metadata. Wide band audio actually heavily compressed to narrowband, audio listed as uncompressed or no codec applied but looks in the data like it came out a cheap cellphone.

I am not sure what to do this info :p would you guys message the dataset authors and suggest a correction to the data? It makes the results of hundreds of papers written under the assumption they were training on propperly anotated data suddently… questionable.

Or am I just full of myself and this kind of undisclosed ‘muddy’ data is fine because ‘AI’

What would you guys do? file it under cool story bro?

submitted by /u/Wooden_Leek_7258
[link] [comments]

[Dataset] 150k+ Annotated Stool Images — Available For Research/commercial Licensing

I’ve built what I believe is the largest annotated stool image dataset in existence (~150k+ photos) and I’m exploring whether to license it for research or commercial use. Posting here to gauge interest and get feedback before I decide how to distribute.

What’s in it

  • Size: ~150,000 images (and growing)
  • Source: user submissions via {{iOS/Android consumer app, real-world in-toilet photos}}
  • Resolution: {{typical resolution range, e.g. 1024×1024 up to 4032×3024}}
  • Diversity: {{geographic spread, device/camera variation, lighting conditions, toilet/water conditions}}

Annotations (per image)

  • Bristol Stool Scale (type 1–7)
  • {{color, consistency, volume estimate, blood/mucus flags — list whatever you actually have}}
  • {{any free-text notes, symptoms, or linked user-reported metadata like diet, hydration, medications}}
  • Annotator: {{self-reported by user / reviewed by clinician / AI-assisted + human verified — be honest}}
  • {{Inter-rater agreement or QA process, if any}}

Provenance & compliance

  • Collected under {{Privacy Policy / ToS URL}} with explicit user consent for {{research use / model training}}
  • {{PII stripped: no faces, no identifying EXIF, no filenames containing user IDs}}
  • {{HIPAA status — usually not HIPAA since it’s a consumer app, not a covered entity, but state it clearly}}
  • {{GDPR: EU users’ data handled per … / excluded / anonymized}}
  • Not sourced from clinical/hospital settings — this is consumer-generated, in-the-wild data

What it’s useful for

  • Training classifiers for Bristol scale, blood detection, abnormality flags
  • Gut health / GI apps, telehealth triage, IBD/IBS monitoring research
  • Benchmarking medical vision models on messy, non-clinical imagery

Licensing

  • Open to: {{non-exclusive research license / exclusive commercial license / per-sample pricing / academic free + commercial paid}}
  • Can provide a {{small sample pack, e.g. 500 images}} under NDA for evaluation

DM or comment if interested — happy to answer questions about the schema, provide sample images, or discuss licensing terms.

submitted by /u/SamePersonality5183
[link] [comments]

50 Years. 9,000 Families. Three Generations Of Family Data. One Very Hard Dataset.

This dataset has tracked the same thousands of American families for 50 years — parents, children, grandchildren. But almost nobody uses it because it is notoriously hard to work with. I wrote a beginner’s guide covering registration, variable selection, FIMS, building person IDs, and exporting a clean CSV. Includes sample Python code. Might be useful if you’ve ever wanted to work with longitudinal family data but didn’t know where to start. Disclosure: I wrote this guide.

https://medium.com/@jfoley648/the-most-interesting-dataset-in-the-world-136946347af2

submitted by /u/Snoo752
[link] [comments]

[Discussion] A 7-dimension Quality Scoring System For Reasoning Datasets — Methodology + Feedback Wanted

Most dataset quality labels I’ve seen are a single score (accuracy, or “is_valid: true”). After building three reasoning datasets for LLM fine-tuning (legal, clinical, financial) I kept hitting cases where a single score hid the actual problem — e.g., an answer that was factually correct but cited a nonexistent case, or one with perfect citations but a broken reasoning chain.

So I broke quality into 7 dimensions, scored per-example:

  1. Correctness — does the conclusion match ground truth?

  2. Reasoning coherence — does each step follow from the previous?

  3. Citation accuracy — every reference verified against source?

  4. Completeness — are all required fields populated?

  5. Factual grounding — any hallucinated facts?

  6. Consistency — are labels applied the same way across the corpus?

  7. Reproducibility — can the conclusion be re-derived from the rule/inputs alone?

Each dimension gets 0.0–1.0. Final score is the geometric mean (one bad dimension should tank the example, not average out). Low-scoring examples are kept in the corpus but flagged in metadata so downstream users can filter them.

What surprised me during scoring:

– ~18% of GPT-4 generated legal analyses had fabricated citations that looked real (wrong year, wrong court, right-ish case name)

– Reasoning coherence and citation accuracy were almost uncorrelated — you can have one without the other

– Consistency (dimension 6) was the hardest to measure and the most valuable once I did — it surfaced a whole class of “label drift” where mid-corpus annotation standards had shifted

Applied to:

– 445 US appellate legal reasoning examples (median score 0.92)

– 493 clinical reasoning traces (median 0.88)

– 1,000 financial routing/classification examples (median 0.94)

Full methodology writeup: https://labelsets.ai/lqs-methodology

Genuinely curious:

– Has anyone tried something similar with more/fewer dimensions?

– Is geometric mean the right aggregation, or does anyone use a weighted model?

– For reasoning datasets specifically, which dimensions are you most suspicious of when evaluating external data before buying/using it?

Happy to go deeper on any dimension in the comments.

submitted by /u/plomii
[link] [comments]

Hello Can You Help Me To Arrange Open Access Dataset For ALS Disease With Any Two Modality EHR , EMG Or Speech

Hi everyone,

I’m currently working on a research project focused on Amyotrophic Lateral Sclerosis (ALS) and I’m trying to build a multi-modal dataset for experimentation.

I’m specifically looking for open-access datasets (or datasets with relatively easy approval) that include any two of the following modalities:

• EHR / clinical data (patient records, ALSFRS scores, demographics, etc.)
• EMG (electromyography signals)
• Speech / voice recordings

So far I’ve explored sources like EverythingALS (speech + patient-reported data) and some EMG datasets on Kaggle, but I’m struggling to find well-structured or commonly used combinations across modalities.

If anyone here has:

  • Links to relevant datasets
  • Suggestions of repositories or research groups sharing data
  • Experience combining datasets for ALS (especially multi-modal setups)

I’d really appreciate your guidance.

Also open to any advice on dataset alignment / fusion strategies if you’ve worked on something similar.

Thanks in advance!

submitted by /u/Hungry-Objective-173
[link] [comments]

[PAID] Premium B2B Intelligence Datasets — YC Companies, CTO Contacts, Buyer Intent Signals, AI Training Data — Private Deals At Discounted Rates

HSH Intelligence is offering 10 proprietary datasets for immediate private licensing at significantly discounted rates for fast moving buyers. We are open to negotiation and bundle deals.

What is available:

  1. 5,601 Y Combinator company profiles with verified founder emails, batch, funding, and tech stack
  2. 2,851 CTO and VP Engineering contacts with verified emails and GitHub profiles
  3. 3,151 Shopify store owner profiles with revenue estimates and contact details
  4. 435 recently funded startups with funding amount, round, and investor names
  5. 63,678 buyer intent signals from companies actively evaluating software right now
  6. 150GB AI training instruction response pairs in HuggingFace compatible JSONL format
  7. 1TB SEC Edgar financial filings structured as AI training data
  8. 1GB GitHub code corpus from 6,000 plus repositories across 13 programming languages
  9. 27,000 plus funding news records with latest announcements including CEO and CTO names
  10. 552,039 clean verified B2B contact records enriched with emails, tech stack, and funding signals

Pricing starts from $500 for individual datasets. Bundle deals available at 50 percent off standard market rates. All data delivered within 24 hours in CSV or JSON format. Free 100 row sample available on request before any purchase.

Visit www.hshintelligence.com or DM me directly for samples and pricing!

Disclosure: I am the founder of HSH Intelligence.

Note: All data is sourced exclusively from publicly available sources in the public domain. No private or consent restricted data is included. Full compliance documentation available at www.hshintelligence.com/trust-center

submitted by /u/HealingSunHaven
[link] [comments]

Full Historical And Real-Time BlueSky Dataset In BigQuery [PAID]

I’ve been maintaining a comprehensive Bluesky dataset in BigQuery and am looking to license access to cover infrastructure costs on a hobby basis. Due to the nature of Bluesky and the underlying ATProto, this includes all posts, follows, likes, etc.

Unfortunately, it’s gotten expensive, and I’m going to have to shut it down if I can’t find a way to reduce the cost.

What’s available: – ~11.4 billion raw events – Full historical coverage from Bluesky’s launch, backfilled from ATProto CAR file repositories and normalized into a single unified schema – Ongoing live stream via Jetstream – Raw CAR backfill table also available separately if useful – BigQuery-native access — no ETL on your end

Unpacked tables include: – Posts (with hashtags, links, mentions) – Likes, reposts, follows, blocks – Deletes – Profile updates – Follower/friend graph materialized views

Who this might be useful for: – Researchers studying decentralized social networks, post-Twitter migration, or online discourse – Media intelligence / social listening products – ATProto developers who want query access to the full event history

Since this is in BigQuery, you can do joins, which leads to all kinds of fun queries like “Give me all the accounts most overfollowed by the unique followers reached by posts mentioning “Chartreuse Goose” for all time.” A query like that would run in 15-30sec.

Also 100% open to releasing to the community if we can find a way to pay for it.

Anyone interested? Not trying to turn a profit here — just trying to keep a resource online. (Hope that’s OK for the rules here!)

submitted by /u/aboothe726
[link] [comments]

Looking For Early, Unredacted Iraq War Logs

I’m looking for the original Iraq War Diary/Iraq War Logs SQL/CSV dumps from Wikileaks, circa 2010-2012. More than ten years ago I was reading specific entries for a research project. The incident narratives were fully unredacted. Now, going back to the same entries, Wikileaks has redacted specifics like unit names and locations, replacing them with “%%%.” That makes the info basically useless for my purposes. Most of the 300,000-ish entries were never crawled by the Wayback Machine, so that’s no good. Harvard’s public Dataverse dataset is the newer scrubbed version, as are the files I’ve seen on Github.

Any help is much appreciated. Please feel free to DM me. I’m only looking for about two dozen specific entries, and I can share those reference numbers if that’s easier.

submitted by /u/FelineNursery
[link] [comments]