Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Where Can I Find Good Time Series Data On Healthcare

I have an assignment on time series and I really want to focus on healthcare since I want to get into health data analysis.

I have looked on websites like WHO, world data bank, statsCan, UC machine learning repository, but it’s hard to find data meets all my requirements. I know this question been asked before, but I would like some new insights

submitted by /u/BuddyPersonal433
[link] [comments]

Nobody Asked But I Organized The FBI NIBRS Dataset (30M+ Records) Into A Searchable Site

Hello everyone reading. I finally got around to publishing a small project I’ve been working on for the past few months.

I was experimenting with the FBI NIBRS dataset and ended up organizing about 30M+ incident records into parquet files so they’re easier to query. I used DuckDB on the backend and built a simple site to explore incidents, offenders, and victims without needing to download the raw files.

The original dataset is pretty messy and spread across a lot of tables, so most of the work was figuring out how to structure it and join everything correctly.

It’s nothing crazy, just something I built out for fun while learning more about data engineering. If anyone has suggestions on improving the schema or query performance I’d definitely like to hear your thoughts.

Repo: https://github.com/that-dog-eater/nibrs-search

submitted by /u/Empty-Individual4835
[link] [comments]

Dataset: SEC Cyber Incidents Disclosures Labeled By Threat Type And Impact

Disclosure: I created and host this dataset.

I compiled a dataset of 80 cybersecurity incident disclosures from SEC filings (primarily 8-K reports) and labeled them using a structured taxonomy.

The goal was to create a more usable dataset for analyzing real-world cyber incidents based on public disclosures.

Dataset includes:

  • Threat type classification (ransomware, data theft, insider, supply chain, etc.)
  • Indicators of business impact (operational disruption, recovery status)
  • Sector categorization (e.g., financial services)
  • Whether cyber insurance was mentioned
  • Source filing references (SEC EDGAR)

Some high-level observations from the dataset:

  • ~72% of cases indicate incomplete recovery or significant disruption
  • 50% involve data theft or exposure
  • Financial services is the most represented sector
  • ~18% mention cyber insurance

Methodology:

  • Source: SEC EDGAR (8-K incident disclosures)
  • Manual review of each case
  • Consistent tagging using a predefined taxonomy
  • AI used to assist classification consistency (not fully automated)

Limitations:

  • Disclosure quality varies significantly
  • Many filings are intentionally vague
  • Sample size is still relatively small (n=80)

submitted by /u/LordKittyPanther
[link] [comments]

Built DinoDS — A Modular Dataset Suite For Training Action-oriented AI Assistants (looking For Feedback + Use Cases)

Hey everyone,

I’ve been working on something I’d really appreciate feedback on — DinoDS, a modular training dataset suite for action-oriented AI assistants.

Most datasets today focus on making models better at chatting. But in real products, the harder problem is getting models to behave correctly — deciding what to do, when to retrieve, how to structure outputs, and how to execute workflows reliably.

That’s the gap we’re trying to address.

What DinoDS focuses on:

  • Retrieval vs answer decision-making
  • Structured outputs (JSON, tool calls, etc.)
  • Multi-step agent workflows
  • Memory + context handling
  • Connectors / deep links / action routing

So instead of just improving how a model sounds, DinoDS is built to improve how it acts inside real systems.

We’re currently building this as a modular dataset suite that teams can plug into their training / eval pipelines.

Would love feedback on:

  • What use cases this could be most valuable for
  • Gaps we might be missing
  • How teams here are currently handling behavioral / agent training
  • What would make something like this actually useful in production

Also open to connecting with anyone working on similar problems or looking for this kind of data.

Check it out: https://dinodsai.com/

Cheers 🙌

submitted by /u/JayPatel24_
[link] [comments]

[Dataset] Most-searched Firewood Species In Every U.S. State, Cross-referenced With BTU Heat Output — 50 States, 17 Species, Free CSV

Collected Google Trends data for 17 firewood species across all 50 states over a 12-month period (March 2025–March 2026), using oak as a consistent anchor term across 4 batches to normalize relative scores.

Then cross-referenced each state’s top species against published BTU heat output ratings from Penn State Extension and USDA Forest Service.

Key findings:

  • Oak dominates 35+ states — and it’s the right call at 26.4M BTU/cord
  • Idaho and Montana search for pine above everything else — 35% less heat per cord than oak
  • New Mexico’s piñon pine preference is actually thermally defensible at 24.7M BTU/cord
  • Alaska leads with birch — smart given what’s harvestable there

Dataset fields: State, top species, relative search score, 2nd place species, 2nd place score, BTU output, heat efficiency rating

Downloads:

License: CC BY 4.0 — free to use with attribution.

submitted by /u/Klutzy_Pressurez
[link] [comments]

Looking For Data Sources For AI & Data Governance Research

Dear data community,

I am a researcher currently looking for datasets and inspiration for my work. My research focuses on AI agents within organizations, and my goal is to develop a system where agents can oversee data pipelines, generate lineage, and propose improvements.

Ideally, I am looking for datasets that are either raw or require cleaning, so they can better support data governance use cases (e.g., defining ER models, data quality rules, lineage, etc.). One idea I explored was using data from crypto exchanges, since they are freely available. However, these datasets are typically already well-structured, require minimal cleaning, and do not easily lend themselves to modeling complex governance scenarios (e.g., ER modeling, data ownership, data quality issues).

Additionally, I would like to build a simple machine learning component on top of the data, mainly for completeness and demonstration purposes.

That said, I am finding it quite challenging to identify “realistic” and sufficiently complex datasets that meet these criteria.

I would greatly appreciate any suggestions or pointers to relevant data sources.

submitted by /u/Vegetable_Fishing
[link] [comments]

Building Per-asset LoRA Adapters For Financial News Sentiment — Which Training Path Would You Prefer?

MPORTANT: when i say “which one would YOU prefer”, i mean this because im building this not only for myself.
There must exist people out there running into the same problem. If you are one of those, which one would make you smile?

I’ve been building a community labeling platform for financial news sentiment — one label per asset, not generic.
The idea is that “OPEC increases production” is bearish for oil but FinBERT calls it bullish because it says something about “increasing” and “production.”
I needed Asset specific labels for my personal project and couldn’t find any, so i set out to build them and see who is interested.

I now have ~46,000 labeled headlines across 27 securities (OIL, BTC, ETH, EURUSD, GOLD, etc.), generated by Claude Haiku with per-asset context.
Human validation is ongoing(only me so far, but i am recruiting friends). Im calling this v0.1.

I want to train LoRA adapters on top of FinBERT, one per security, 4-class classification (bullish / bearish / neutral / irrelevant).

Three paths I’m considering:

  1. HuggingFace Spaces (free T4) Run training directly on HF infrastructure. Free, stays in the ecosystem. Never done it for training, only inference.
  2. Spot GPU (~$3 total) Lambda Labs or Vast ai , SSH in, run the script, done in 30 min per adapter. Clean but requires spinning something up, will cost me some goldcoins.
  3. Publish datasets only for now Or i could just push the JSONL files to HF as datasets, write model card stubs with “weights coming.” Labeling data is the hard part — training is mechanical. v0.1 = the data itself. But that is what i built it for, isnt it?

My instinct is option 3 first, then spot GPU for the weights. But curious what people here would do — especially if you’ve trained on HF Spaces before.

Project: <ask me> — contributions welcome if you want to label headlines.

If you’re working on something similar, drop a comment — happy to share the export pipeline.

submitted by /u/Poli-Bert
[link] [comments]

Anyone Has Any Good RIR Mega Dataset In The Audio ML Space? [Synthetic]

Came across this dataset paper that I think deserves more attention.

RIR-Mega is a large-scale collection of simulated Room Impulse Responses (RIRs) designed specifically for ML workflows. What makes it stand out from older RIR datasets:

  • 50,000 RIRs with a clean, flat Parquet metadata schema (RT60, DRR, C50, C80, band RT60s)
  • Three evaluation splits: random, unseen_room, and unseen_distance — so you can actually test generalization

The HF dataset is at: https://huggingface.co/datasets/mandipgoswami/rirmega Paper: https://arxiv.org/abs/2510.18917

Has anyone used this for dereverberation or acoustic parameter estimation? Curious how it holds up against BUT-ReverbDB or OpenRIR for downstream ASR robustness tasks.

submitted by /u/Stellar_Bluebird
[link] [comments]

Anime Revenue In Csv/ Excel Spreadsheet

Hi everyone, im doing a project which i need dataset in csv or in excel spreadsheet regards to anime revenue. Like streaming, tv, merchandise, dvd, events etc. So i tried searching online but i could not find any. Is there any sources where i can find such data.

submitted by /u/Darclo12
[link] [comments]

Project Partner Buddy To Do DA Portfolio Projects

Hello guys, I am an aspiring Data Analyst, I know the tools like SQL, Excel, Power Bi, Tableau and I want to Create portfolio Projects, I tried doing alone but found distracted or Just taking all the things from Al in the name of help! So I was thinking if some one can be my project partner and we can create Portfolio projects together! I am not very Proficient Data Analyst, I am just a Fresher, so I want someone with whom we can really help each othet out! Create the portfolio projects and add weight to our Resumes!

submitted by /u/Substantial_Edge3588
[link] [comments]

Per-asset LoRA Adapters For Financial News Sentiment — Dataset Pipeline, Labeling Methodology, And What’s Going On HuggingFace

Where are the domain-specific LoRA fine-tunes for financial sentiment analysis — one adapter per asset (OIL, GOLD, COFFEE, BTC, EUR/USD, etc.)?

The problem: no labeled dataset exists that’s asset-specific. Generic FinBERT doesn’t know that “OPEC cuts production” is bearish for oil. So I built one.

The pipeline:

~17,500 headlines collected across 35+ securities from RSS, Google News, GDELT, YouTube transcripts, and FMP.

Claude Haiku pre-labels everything with asset-specific context (known inversions, price drivers). Humans review and override.

Why per-asset matters:

Because standard sentiment models like FinBERT treat “Fed raises rates” as bearish across the board.

Or “rising dollar boosts USD index to 3-month high” →

FinBERT: bullish. In the actual gold market this is bearish

Or “OPEC increases production” is it nice for your OIL Futures?
• FinBERT sees “increases”, “production up” → bullish (more output = growth = good)
• Actual oil market → bearish (more supply = price drops)

Labeling methodology:

• 4 classes: bullish / bearish / neutral / irrelevant (per asset, not generic)
• AI seed labels → human consensus → LoRA training data
• Target: ~500 human consensus labels per security before fine-tuning

What’s going on HuggingFace:

• Inversion catalog already live: polibert/sentimentwiki-catalog
• Labeled dataset + LoRA adapters: uploading as each security hits threshold
• First uploads: OIL, GOLD, EUR/USD (most labeled)

Data sources that actually work (and a few that don’t):

Works: OilPrice RSS, FXStreet, CoinDesk, GDELT, YouTube (Bloomberg/Reuters/Kitco), FMP (only paid one)
Doesn’t: S&P Global Platts (paywalled), USDA AMS (PDFs only), ICO coffee (Cloudflare-blocked)

If you work in financial NLP and want to contribute labels or suggest assets: sentimentwiki.io (http://sentimentwiki.io/) — contributions welcome

submitted by /u/Poli-Bert
[link] [comments]

How Do You Search Violations In Bulk In The NOLA OneStop App?

I’m trying to look up multiple property violations at once using the NOLA OneStop website/app, but I can’t find a way to run a bulk search. Right now it seems like I have to check each address individually. Is there a way to search or export violations in bulk (for multiple addresses or properties) on NOLA OneStop? Or is there another tool or dataset people use for this?

submitted by /u/tshuntln1
[link] [comments]

I Created A Dataset To Make RAG Training Easy.

The more diversity that can be shared at this level, the easier it will be for independent developers to continue to help push the frontiers of what is possible in LLM development.

This dataset is free to use in your projects. Please upvote. Your support means a lot!

Contains 312,000 records that train subject/question/answer classification in a consistent behavior leveraging Wikipedia while retaining source link structures. Ideal for NLP RAG/TriviaQA style benchmarks.

https://huggingface.co/datasets/CJJones/Wikipedia_RAG_QA_Classification

submitted by /u/No-Cash-9530
[link] [comments]

[Showcase] Structuring 2,170+ TCM Herbs Into JSON: Challenges In Data Normalization

Hi everyone, I’ve spent the last few months digitizing and structuring a database of 2,170+ traditional medicinal herbs. The biggest challenge wasn’t just translation, but mapping biochemical compounds (like Astragaloside IV) to qualitative properties (Nature/Taste) in a way that modern systems can process.

Technical Breakdown:

  • Nomenclature: Cross-referenced English, Latin, and Hanzi.
  • Safety Data: Structured toxicity levels and contraindications.
  • Structure: Validated JSON, optimized for knowledge graphs.

I’ve put together a substantive summary and a 50-herb sample for anyone interested in the data schema or herbal research. You can find the documentation and the sample file here: IF ANYONE WANT IT PLS TEXT ME 🥺 ITS FREEE

I’d love to get your thoughts on the schema design, especially regarding the mapping of chemical compounds to therapeutic functions

submitted by /u/Desperate_Spirit_576
[link] [comments]

How To Split A Dataset Into 2 To Check For Generalization Over Memorization?

I wish to ensure that a neural network does generalization rather than memorization.

in terms of using 1 dataset that is a collection of social media chats, would it be sufficent to split it chornologically only so to create 2 datasets?

or something more needs to be done like splitting it into different usernames and channel names being mentioned.

basically I only have 1 dataset but I wish to make 2 datasets out of it so that one is for supervised learning for the model and the other is to check how well the model performs

submitted by /u/Calm_Maybe_4639
[link] [comments]

My Friend Didn’t Know There Was A Simpler Way To Clean A CSV. So I Built One.

A few months ago I was sitting with my friend who’s doing his data science degree. He had a CSV file, maybe 500 rows, and just needed to clean it before running his model -> remove duplicates, fix some inconsistent date formats, that kind of thing.

He opened Power BI because that’s genuinely what his college taught him. It worked, but it took 20 minutes for something that felt like it should take 2.

I realized the problem wasn’t him, there just aren’t many tools that sit between “write pandas code” and “open a full BI suite” for basic data cleaning. That gap is what I wanted to fill.

So I built DatumInt. Drop in a CSV or Excel file, it runs entirely in your browser, nothing goes to a server.

It auto-detects what’s wrong – duplicates, encoding issues, messy date formats, empty columns – gives you a health score and fixes everything in one click.

No code. No heavy software. No signup. Still early and actively improving it.

Curious what data quality issues you hit most often – what would make a tool like this actually useful to you?

(Disclosure: I’m the developer of this tool)

submitted by /u/PriorNervous1031
[link] [comments]

Best Dataset For A First Excel Portfolio Project?

Hi everyone
I’m self-teaching data analytics and just wrapped up my Excel training. Before diving into SQL, I want to build a solid, hands-on project to serve as my very first portfolio piece and my first professional LinkedIn post. I want to build something that stands out to hiring managers and has a long-lasting, evergreen appeal. What datasets do you highly recommend for someone aiming for a data or financial analysis role? Are there specific datasets—like sales, finance, or operations—that never go out of style and perfectly showcase data cleaning, complex formulas, and dashboarding? I’d love your advice on where to find the best fit for a strong, impactful first project!

Thanks in advance

submitted by /u/Living-Bass1565
[link] [comments]

Extracting Structured Datasets From Public-record Websites

A lot of public-record sites contain useful people data (phones, address history, relatives), but the data is locked inside messy HTML pages.

I experimented with building a pipeline that extracts those pages and converts them into structured fields automatically.

The interesting part wasn’t scraping — it was normalizing inconsistent formats across records.

Curious if anyone else here builds pipelines for turning messy web sources into structured datasets.

https://bgcheck.vercel.app/

submitted by /u/Aggressive_Cut7433
[link] [comments]

Open-source Tool For Schema-driven Synthetic Data Generation For Testing Data Pipelines

Testing data pipelines with realistic data is something I’ve struggled with in several projects. In many environments, we can’t use production data because of privacy constraints, and small handcrafted datasets rarely capture the complexity of real schemas (relationships, constraints, distributions, etc.).

I’ve been experimenting with a schema-driven approach to synthetic data generation and wanted to get feedback from others working on data engineering systems.

The idea is to treat the **schema as the source of truth** and attach generation rules to it. From that, you can generate datasets that mirror the structure of production systems while remaining reproducible.

Some of the design ideas I’ve been exploring:

• define tables, columns, and relationships in a schema definition

• attach generation rules per column (faker, uuid, sequence, range, weighted choices, etc.)

• validate schemas before generating data

• generate datasets with a run manifest that records configuration and schema version

• track lineage so datasets can be reproduced later

I built a small open-source tool around this idea while experimenting with the approach.

Tech stack is fairly straightforward:

Python (FastAPI) for the backend and a small React/Next.js UI for editing schemas and running generation jobs.

If you’ve worked on similar problems, I’m curious about a few things:

• How do you currently generate realistic test data for pipelines?

• Do you rely on anonymised production data, synthetic data, or fixtures?

• What features would you expect from a synthetic data tool used in data engineering workflows?

Repo for reference if anyone wants to look at the implementation:

[https://github.com/ojasshukla01/data-forge](https://github.com/ojasshukla01/data-forge)

submitted by /u/Business-Quantity-15
[link] [comments]