Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Quiero Crear Una Web Sobre La Historia De Club Atlético Independiente (siglo XXI) — Cómo Paso Mis Datos De Excel A Una Web?

Hola, tengo un proyecto en el que me gustaría hacer una pagina web sobre la historia de independiente (me gustaría de todo el tiempo, pero por ahora todo el siglo XXI). Como por ejemplo, tiene una lanus que es esta muy buena. Se llama museogranate.clublanus.

Me gustaría añadir también, todos los partidos y formaciones de cada partido. Y toda la información posible dentro de ese partido (formaciones de ind, y del equipo rival, amarillas, rojas, goles, asistencias, y cambios).

Como extra, tambien, tenia pensado hacer una clasificacion de cada torneo del siglo XXI, y poder ver como estaba la tabla en tal fecha. Por ejemplo, quiero ver la tabla de clasificaciones del apertura 2010 en la fecha 9. Y también se vería todos los partidos que se jugaron, y los respectivos goles con sus respectivos minutos.

Todo esto lo tengo anotado en un excel, pero no se como llevarlo a una pagina web. No tengo las habilidades necesarias para programar, pero puedo aprender, que me recomiendan??

submitted by /u/Few-Replacement-6351
[link] [comments]

[Synthetic][PAID][self-promotion] Made-to-order Training Data Generator With Web Search And Exports

Disclosure: I’m on the Abliteration team.

We just shipped a training-data generator for people who need specific examples rather than another generic public dataset.

You describe the examples you want and it generates structured synthetic data. If the dataset needs current or real-world facts, you can turn on web search. Exports are live for Hugging Face, Kaggle, S3, and OpenAI.

The first use cases we built around are classifier and eval datasets for trust and safety: grooming detection, harassment detection, security research evals, jailbreak and edge-case sets, and similar work where teams need examples that general-purpose models often refuse to generate.

I marked this as synthetic and paid because the outputs are generated and this is a commercial tool.

Product: https://abliteration.ai/

Synthetic data page: https://abliteration.ai/use-cases/synthetic-data

Launch video: https://x.com/abliteration_ai/status/2054675554138194178

For people who curate datasets: what export format or per-row provenance metadata do you usually need before a generated dataset is usable?

submitted by /u/Effective_Attempt_72
[link] [comments]

STM32H7 Fatigue Detection: 1M Rows → 85k Rows, 512KB RAM, <100ms Inference — Is 4Hz Resampling The Right Move?

Building a real-time fatigue detection system for STM32H7 deployment.

Constraints:

  • 512KB RAM
  • <100ms inference
  • preprocessing on laptop
  • inference on-device only

Dataset:
~1M rows from asynchronous wearable sensors.

Sensor Native Frequency Notes
ACC 32 Hz wrist accelerometer
EDA 4 Hz electrodermal activity
Temp 4 Hz skin temperature
HR 1 Hz heart rate
Breathing 1 Hz respiration
IBI ~0.59 Hz irregular inter-beat interval

Labels:

  • fatigue
  • activity
  • baseline

Current preprocessing strategy:
Resample everything to 4Hz.

Signal Strategy
ACC 32→4Hz mean over 8 samples
EDA/Temp native 4Hz
HR 1→4Hz linear interpolation
Breathing 1→4Hz linear interpolation
IBI ~0.59→4Hz forward-fill

Result:
~1M rows → ~85k synchronized rows.

Current doubts:

  1. ACC to 4Hz: Using only the mean feels too lossy. Should I also include:
  • std
  • max/min
  • magnitude
  • energy

per 250ms window?

  1. IBI: Forward-fill feels mathematically dirty for HRV-related information. Would it be better to:
  • keep IBI irregular
  • compute RMSSD/SDNN at native timing
  • feed only HRV features downstream?
  1. HR/Breathing: Does interpolating 1Hz → 4Hz introduce fake temporal resolution? Would keeping them at 1Hz be cleaner?

Considering switching to a multi-rate pipeline:

Signal Group Frequency
ACC 8 Hz
EDA/Temp 4 Hz
HR/IBI/Breathing 1 Hz

Question:
For embedded ML / TinyML deployment, is multi-rate worth the added pipeline complexity, or is synchronized 4Hz generally the better engineering tradeoff?

Would appreciate advice from anyone working with:

  • wearable signals
  • HRV
  • TinyML
  • embedded inference
  • multimodal physiological data

submitted by /u/Aziz_2002
[link] [comments]

20k Reddit Crypto Sentiment Dataset With Bitcoin Market Labels

I recently created my first public dataset focused on cryptocurrency sentiment analysis and Bitcoin market forecasting. The dataset contains around 20,000 Reddit posts collected from major crypto communities between 2017 and 2025 using the PRAW API.

It includes:

  • Reddit post metadata
  • Cleaned text features
  • Crypto-enhanced VADER sentiment
  • Custom FinBERT sentiment scores
  • Bitcoin prices and returns
  • Binary BTC movement labels for 1h, 6h, 12h, and 24h horizons

The dataset was built for financial NLP, sentiment analysis, and forecasting research. I am still learning dataset engineering and would appreciate feedback, suggestions, or ideas for improvement.

submitted by /u/Cyclo_Studios
[link] [comments]

[self-promotion] Free 20-record Samples (CSV + JSON) Of 20 Dev/AI Datasets — Npm, MCP Servers, HuggingFace Models, Homebrew, Etc.

Hi r/datasets — disclosure first: I sell a paid version of these on Gumroad ($34, 83% off launch). I’m posting the free 20-record samples here because they’re genuinely useful on their own and the mod rules ask self-promotion to be labeled.

What’s in the free samples:

20 niche datasets, each with 20 fully-enriched records as CSV + JSON. ~55,000 records total in the paid version (54,958 as of today). Topics:

  • ai-tools, ai-agents, ai-prompts, ai-models-pricing (13 paid Llama 3.3 70B providers compared)
  • public-apis, mcp-servers (2,971), developer-tools, vscode-extensions
  • self-hosted-software, open-source-alternatives, no-code-lowcode
  • design-resources, cybersecurity-tools
  • npm-packages (top by weekly downloads), homebrew-formulae
  • huggingface-models (top 4,000 by downloads), huggingface-datasets (2,600+)
  • vector-db / RAG ecosystem, ai-agent-frameworks (1,324 records — grew 6.6x in 8 days)

Why I built them:

I kept needing structured, queryable lists of “all the X tools” for filterable directory builds. Awesome-lists and READMEs are great for browsing but useless for jq / SQL / search infrastructure. So I curate, normalize, validate (zero invalid records), enrich with stars/downloads/installs, and refresh.

Per-record fields are typed — categorizationTier rates each record 87-100% specific (vs vague “tool” labels). Open question for the sub: how do you handle tier-of-specificity in your own dataset categorization work? My current rubric is per-dataset config-driven but I’m curious what others do.

Free samples (CSV + JSON, MIT-style permissive): https://github.com/futdevpro/niche-datasets-free

Includes mega-sample.json (5 random records from each of the 20 datasets, 100 records total).

Paid version on Gumroad — $34 launch price (83% off $198 list), monthly refresh on AI Models Pricing because OpenRouter changes weekly, quarterly on the rest. Linked from the GitHub README if anyone wants the full thing.

Happy to answer questions about the catalog, methodology, or specific datasets.

submitted by /u/Jhonny_Ronnie
[link] [comments]

How To Apply Normalization For Cross Sectional Time Series Data ?

I am unable to convince myself to use one method.
Some methods that i thought of were :

  1. I use normalization for full training data of one subject across all features. In this method, i am introducing some kind of lookahead bias, and also this loses on some information which could have been valuable. And also when i want to use one model ( suppose regression with gradient descent) for the subjects combined, then I am unable to judge if this will be a good method.
  2. A bad method was to not care about the subjects, and just normalize across full feature. but this just feels wrong to me.
  3. I was reading about cross sectional normalization which ranks the subjects and does some kind of normalization. But i am unsure how that would be useful.
  4. Another way i found was by using some rolling window, where i keep normalizing not over full data, but the past window data. This seems better but here also what choice of window should be done, and there are lot of questions.

And the bigger problem over all of these is the time series . I would lose quite a lot of information when i don’t consider these. ( although not all features have a big factor of this).

submitted by /u/Virtual-Current6295
[link] [comments]

[self-promotion] I Scraped ~70k Geopolitical Risk Events From Public Feeds. Only About A Quarter Made The News. (Parquet + CSV On HF/Kaggle)

I’ve been building an open dataset of geopolitical and supply chain risk events scraped from public feeds (GDELT, ACLED, GDACS, NASA FIRMS, WHO DON) for the past few months. Around 70k events at this point. The thing that surprised me when I cross-checked against mainstream news coverage: only about a quarter of those events have any major-outlet article attached.

The other ~72% are silent. Flagged in at least one public feed but never picked up by major news. I’d assumed those would all be low-severity noise (small protests, minor weather flags, single-source rumors). They’re not. Roughly a quarter of the silent set is still rated critical or high severity by the source feed itself, which works out to ~14k events nobody covered. ACLED specifically dominates the silent set — local conflict events that don’t make English-language outlets.

The cross-check has obvious limits worth flagging up front: my “news coverage” is a Google News fetch (so paywalled or non-English coverage gets undercounted), and the severity is graded after the fact by an LLM step (so wrong angle on ambiguous events). Both are best-effort. But the headline gap — ~28% news overlap — is just a SQL join, not LLM-dependent. Events are geocoded by region, no PII. Actor names from ACLED are excluded per their license.

The deduplicated event/chokepoint/entity tables are up on Hugging Face and Kaggle as Parquet + a 10% CSV sample, CC-BY-NC-SA. Browsable map version is at tremorwatch.com if you want to poke at individual events first. Curious if anyone has tried something similar at this scale and how you’d refine the coverage definition (different news source mix, embedding-based fuzzy match, etc).

Disclosure: I built this — part of an early-stage startup (Volt AI). Dataset is free under CC-BY-NC-SA, no paid tier exists yet. Posting under r/datasets self-promo guidelines; happy to adjust format if mods prefer.

submitted by /u/Latter_Panda4439
[link] [comments]

[Dataset] NAICS Contagion Map: Topological Edge Network Mapping 1,100+ Supply Chain Cascades Across 340+ Industries

I’m releasing the NAICS Contagion Map, a dataset designed to bridge the gap between the physical economy (NAICS) and financial market taxonomies (GICS).

The goal was to map how volatility in upstream raw materials (Tier 4) systematically ripples down to consumer-facing products (Tier 1). This is particularly useful for anyone doing economic modelling, supply chain resilience analysis, or ESG/Risk research.

What’s inside the CSVs:

  • 340+ NAICS Nodes: Each assigned a Contagion Score (1.0 – 10.0) based on upstream concentration (HHI) and structural importance.
  • 1,100+ Topological Edges: Mapping the exact flow from Tier 4 (Commodities) -> Tier 3 (Extractors) -> Tier 2 (Processors) -> Tier 1 (Assemblers).
  • NAICS to GICS Bridge: Each node is mapped to its financial sector equivalent.

Methodology: This is a derived dataset. The structural tiers and contagion scores were generated via a deterministic heuristic algorithm I built that analyzes industrial interdependencies. While the raw NAICS data is from Census/GICS registries, the relationship mapping (edges) and risk scoring are my original derivation.

Access GitHub Repo

Full Disclosure : I am the creator of this project. I’m sharing this as a free open-source intelligence drop for the community to play with. I’d love to get feedback on the edge logic or hear how you’re using the topology in your own models.

submitted by /u/Vast-Village-2596
[link] [comments]

Open Hantavirus Case Dataset – Aggregated From WHO/CDC/ECDC/PAHO/ProMED, CC-BY-SA JSON API

Sharing a dataset I’ve been maintaining since the MV Hondius hantavirus cluster started in early April.

Aggregated from primary public health sources: WHO Disease Outbreak News, CDC HAN advisories, ECDC bulletins, PAHO weekly reports, ProMED-mail, and national health ministries. Cron pulls every 30 minutes, normalizes case definitions per WHO DON600 framework, geocodes to city or province level where source data permits, dedupes against the archive.

Format: JSON
License: CC-BY-SA 4.0
Endpoint: https://hantaosint.com/api/v1/public.json
Dashboard: https://hantaosint.com
Methodology: https://hantaosint.com/methodology

Fields: case_id, date, country, region, virus_strain, confidence_level (confirmed/suspected/probable/monitoring), source, source_url, lat, lng

Confidence levels are kept separate rather than conflated, which most outbreak trackers don’t bother with. Historical outbreaks included for retrospective analysis: 1993 Four Corners, 2012 Yosemite, 2018-19 Epuyen.

Use cases I built it for: time-series modeling of cluster spread, retrospective comparison of hantavirus outbreaks, surveillance signal for travel medicine research.

Happy to add fields if researchers need additional structure. Open to feedback on the schema and source coverage.

submitted by /u/Professional_Art2346
[link] [comments]

US Capital Punishment (1999): A Curated Dataset Of Judicial Executions For Criminology And Data Science – Zenodo

https://zenodo.org/records/20130055
This curated dataset provides a comprehensive and high-fidelity record of the judicial executions carried out in the United States during the year 1999, which represents the historical peak of capital punishment activity in the modern era.
https://zenodo.org/records/20130055
https://github.com/lightbluetitan/us-capital-punishment-1999

submitted by /u/renzocrossi
[link] [comments]

What Publicly Available Recurring Data Source Do You Repeatedly Search For That Still Doesn’t Exist In Clean Structured Format?

I’m researching gaps in publicly available recurring data that people regularly need for analytics, ML, automation, monitoring, or business workflows.

I’m especially interested in data that is technically public but still difficult to use because it is:

  • trapped in PDFs
  • scattered across websites
  • updated inconsistently
  • available only through dashboards
  • difficult to scrape
  • missing historical archives
  • lacking APIs
  • poorly standardized

Examples could include:

  • government notices
  • procurement/tender data
  • financial filings
  • real-estate listings
  • agriculture pricing
  • shipping/logistics updates
  • business registries
  • market prices
  • legal/regulatory documents
  • municipality/city data

submitted by /u/strange1807
[link] [comments]

What Kind Of Robot Manipulation Datasets Are Teams Actually Looking For Right Now?

I’m trying to understand what robotics and embodied AI teams actually need when collecting real-world training data.

The use cases I keep hearing about are:

-robotic hand manipulation

-grasping and pick-and-place

-soft and fragile object handling

-tabletop tasks

-warehouse tasks

For teams working on imitation learning, VLA models, or robot manipulation, what is usually the biggest bottleneck?

-not enough real-world data

-task diversity

-camera and sensor consistency

-annotation quality

-hardware-specific data

I work with a small team connected to robotic visual data collection, but I’m mainly trying to understand what teams actually need before going too deep in the wrong direction.

submitted by /u/WideAmbition1964
[link] [comments]

Tool For Data Ingestion, Transformation, Orchestrations, And Analysis [self-promotion]

Disclaimer, I’m a developer advocate at Bruin. I previously worked in data analyst and then data engineering roles for almost 10 years, and now at this job I finally have the freedom to play around with data just for fun. This community has always been my go to place to find cool datasets.

That’s why I’m excited to share this announcement with you but I promise to keep the promotional talk very minimal.

I’m sure many of you use AI agents to analyze data, build dashboards, and share them with friends and others. Bruin has a lot of open-source tools for data ingestion, transformation, orchestration, and visualization. Today we are announcing the general availability of Bruin Cloud which is the managed service of those free open-source tools.

I’m personally excited because as a dev advocate I’ve focused mainly on our open-source tools but managing and deploying them locally is sometimes an obstacle for someone that just wants to play around with data – so the free tier (no payment required) version of Bruin Cloud will give you enough credits to get started to run your pipelines but more importantly analyze your data using the AI data analyst and dashboard builder.

Check out the open-source tools: https://github.com/bruin-data

If interested, feel free to check Bruin Cloud too: https://cloud.getbruin.com/register

submitted by /u/uncertainschrodinger
[link] [comments]

Looking For A Synthetic Business Datawarehouse That Keeps Getting Updates

Basically title.

For context: I am building a startup and for demo purposes we want to setup a new demo tenant with fake business data. The closes thing I’ve found is the Microsoft Contoso dataset, but like many other options, its just a data set, not a hosted datawarehouse that keeps getting (preferably daily) updates.

Ideally I’d just plugin with sql db credentials and go to town with a read only user.

Does any1 know if something like this exists?

submitted by /u/Alert-Track-8277
[link] [comments]

B2B SaaS Account Health Dataset – Synthetic But Realistic B2B SaaS Dataset Modeled After Platforms Like Datadog, HubSpot, And Amplitude. 50,000 Customer Accounts With 18 Features Covering Product Engagement, Billing, And Support Metrics.

https://www.kaggle.com/datasets/akshankrithick/b2b-saas-account-health-dataset

Synthetic but realistic B2B SaaS dataset modeled after platforms like Datadog, HubSpot, and Amplitude. 50,000 customer accounts with 18 features covering product engagement, billing, and support metrics.

Three Prediction Tasks

  1. Churn prediction (binary): Will this account cancel within 90 days? (~9.5% churn rate)
  2. Revenue prediction (regression): What is this account’s next-month revenue?
  3. Health segmentation (multiclass): Thriving / Stable / At Risk / Critical

submitted by /u/devilwithin305
[link] [comments]

I’ve Been Recording My Poops For Ten Years

I have Ulcerative Colitis and a nerd brain. I’ve been tracking my bowel movements for 10 years. I built myself a little dashboard to log every stool. So I have date, time, Bristol stool type, urgency, and any blood present (because UC).

Maybe I’m not the first person to do this, but if I am then there might be some use that the data could have?

Does anyone have any suggestions about what I could do with the data? Any kind of value for researchers?

I did skip tracking for about two years in the middle so it’s actually about 8 years worth of data but doing back 10 years.

submitted by /u/robertShippey
[link] [comments]

[self Promotion] Pentagon’s Declassified UFO Release Parsed To Markdown And JSON

Yesterday out of nowhere, the Pentagon released the first tranche of the PURSUE archive which has hundreds of declassified UAP records going back to the 1940s, sourced from the FBI, NASA, etc. The raw release is mostly scanned PDFs with no text layer, which makes the whole thing essentially un-queryable without an OCR pass.

Ran it through a parser to produce clean markdown and JSON. Free to download: https://pursue-release-01-parsed.vercel.app/

And I threw in a map and search for fun bc why not. The data is from the Dept of War (unaffiliated), the parsing came from Extend (I work here) and the site was generated with Codex (unaffiliated).

submitted by /u/tuberreact
[link] [comments]

National Public Database Leak Download

Hello,

Does anyone know how to download/have a link for the full National Public Database leak? I tried searching extensively on the clearnet and dark web but I can’t find anything other than 2 old Github repo’s with broken download links. I just want to explore the database and do some data analysis stuff on it, nothing bad 🙂

Any help would be greatly appreciated!

submitted by /u/avrageliminaluser
[link] [comments]