Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Looking For Bloomberg ESG Disclosure Scores For ~1,500 EU Listed Firms (2014-2023) – Bachelor Thesis

Hey everyone,

I’m a bachelor student at Erasmus University Rotterdam working on my thesis about CEO tenure and ESG disclosure quality in EU firms.

I need the Bloomberg ESG Disclosure Score for approximately 1,500 listed EU companies across the Energy, Materials, Industrials and Utilities sectors, covering the years 2014-2023.

Unfortunately our university only has access to LSEG/Refinitiv which doesn’t include this specific metric.

If you have access to a Bloomberg Terminal and would be willing to help, I would need:

  • ESG Disclosure Score per firm per year (2014-2023)
  • For ~1,500 companies (I have the full ISIN list ready)
  • Output as a simple Excel file

Happy to share our full company list and explain exactly what’s needed. This would make a huge difference for our research.

DMs open – any help is massively appreciated!

submitted by /u/DaanB2707
[link] [comments]

Oblique Imagery Data / Real Estate Arial Imagery

Hey everyone, I’m working on sourcing SB 721 leads across Southern California — specifically trying to identify multifamily buildings with exterior elevated elements like balconies, exterior walkways, and deck structures. The problem I’m running into is that to properly pre-qualify these buildings visually before burning skip trace credits, I really need oblique imagery — the angled aerial photography that actually shows you the side of a building rather than just the rooftop. Platforms like Nearmap and Pictometry are the gold standard for this but the licensing cost for regional coverage across LA, Orange, Ventura, and San Bernardino counties is running $10,000–$25,000, which doesn’t make sense for a lead generation use case. I’ve already tried Google Street View and Google Maps 45° imagery and coverage is way too patchy — especially on the secondary and tertiary streets where most of the 3–8 unit wood-frame stock from the 1960s–80s actually sits, which is exactly the inventory I’m targeting. The core problem is that county assessor data and property APIs can confirm unit count and ownership, but nothing in my current stack can tell me whether a building actually has qualifying EEEs without someone physically driving by or paying for imagery I can’t justify at this stage. Does anyone know of alternatives — whether that’s a lower-cost oblique imagery provider, a per-area-of-interest pricing model, AI tools that can classify building features from whatever imagery is available, or any other creative approach people have used to visually pre-qualify multifamily buildings for EEE identification at scale in SoCal? Also — long shot but if anyone has an existing Nearmap or Pictometry subscription they’re not fully utilizing and would be open to sharing access or credentials, I’d love to work something out. Happy to compensate or collaborate. Any direction at all would be really appreciated.

submitted by /u/Prestigious-Tip927
[link] [comments]

I Am Looking For A Car Color Dataset

I’m looking for a dataset that explores the relationship between car color and driving related factors or consumer behavior. For example, I’m interested in statistics showing whether certain car colors are associated with higher accident rates, speeding tendencies, insurance claims, resale value, or buyer preferences. Ideally, the dataset would include measurable data on topics such as accident frequency by vehicle color, popularity of specific colors among consumers, or correlations between car color and driver behavior

submitted by /u/Fireking1021
[link] [comments]

Most Demanded Domains For Datasets Globally?

I was just looking for the most in demand datasets domains globally, and found out that E-commerce product listings, Job listings / salary /skills, Real estate listings (who’s making a model for RE?) are among the top. Have any of you worked with these domains before? What’s your experience with them?

submitted by /u/DeamosV
[link] [comments]

[Synthetic][PAID][self-promotion] Opinions Wanted On Vision Training Data

I’ve marked as Paid, synthetic, self-promotion, as ultimately I work for a commercial organisation – Synthera. but there is a free version which enables you to do exactly what I am sharing here, so I hope this is of some use.

We just released version 26.1 of the tool which has much better pedestrian rooting.

https://vimeo.com/1192312025/c82f863dc1?share=copy&fl=sv&fe=ci

Would love to know what people think.

For information the setup for creating this content took around 15 minutes, and then around an hour to create 2400 fully annotated frames.

submitted by /u/Syrup1971
[link] [comments]

Trying To Build A Modell That Predicts Speed Through Water For Sailboats

Hey as the title reads I am currently working on building a modell that predicts the speed through water from other more paramaters more easy meassured on sailboats. However to this I need a bunch of data of actual sailing where they have meassured things such as speed, wind and also speed through water.

Do any of you have any idea how to find data like this? I have searched around online but not really found anything.

Any help is appreciated!

submitted by /u/Dry_Situation2154
[link] [comments]

Open Source Project Which Constructed A 70:30 Split Dataset (translations:instructions) For Fine-tuning Google’s TranslateGemma For Improved Bidirectional English Welsh Translations!

I constructed a 70:30 split of translations to instruction prompts for fine-tuning Google’s translategemma-4b-it LLM model which specializes in translation tasks, the project is fully open source.

Given my limited GPU budget I couldn’t expand this to include 100% of the welsh:english translation datasets, so a different data recipe could substantially improve the fine-tuning training data and resulting quality of output translations (especially if trained on 12B or 27B next).

What language translation pairs would you want to see fine-tuned into the TranslateGemma models? I was originally thinking of Klingon but I couldn’t easily find datasets for it on huggingface nor kaggle, so I went with Welsh since I found several million rows of data for it..

submitted by /u/ufos1111
[link] [comments]

What’s Your Actual Day-to-day Stack? (Asking For The Messy, Non-tutorial Version)

I’ve been optimizing data pipelines at a mid-sized company and wow, production is nothing like the clean tutorials online. Most of my week goes into source validation, handling incomplete data, and making sure KPIs actually reflect business goals—not just building ETL for the sake of it.

My core stack is PostgreSQL + dbt + Looker for dashboards, but for quick ad-hoc analysis and monitoring without spinning up heavy infra, I’ve been leaning on Scoop Analytics. It’s cut down a ton of manual upkeep and lets me focus on data quality instead of firefighting.

What’s your real-world combo? Do you prefer keeping everything open-source, or are you cool with SaaS tools that give you time back? Genuinely curious about what works (and what doesn’t) in the wild.

submitted by /u/Extra-Tap-8050
[link] [comments]

Quiero Crear Una Web Sobre La Historia De Club Atlético Independiente (siglo XXI) — Cómo Paso Mis Datos De Excel A Una Web?

Hola, tengo un proyecto en el que me gustaría hacer una pagina web sobre la historia de independiente (me gustaría de todo el tiempo, pero por ahora todo el siglo XXI). Como por ejemplo, tiene una lanus que es esta muy buena. Se llama museogranate.clublanus.

Me gustaría añadir también, todos los partidos y formaciones de cada partido. Y toda la información posible dentro de ese partido (formaciones de ind, y del equipo rival, amarillas, rojas, goles, asistencias, y cambios).

Como extra, tambien, tenia pensado hacer una clasificacion de cada torneo del siglo XXI, y poder ver como estaba la tabla en tal fecha. Por ejemplo, quiero ver la tabla de clasificaciones del apertura 2010 en la fecha 9. Y también se vería todos los partidos que se jugaron, y los respectivos goles con sus respectivos minutos.

Todo esto lo tengo anotado en un excel, pero no se como llevarlo a una pagina web. No tengo las habilidades necesarias para programar, pero puedo aprender, que me recomiendan??

submitted by /u/Few-Replacement-6351
[link] [comments]

[Synthetic][PAID][self-promotion] Made-to-order Training Data Generator With Web Search And Exports

Disclosure: I’m on the Abliteration team.

We just shipped a training-data generator for people who need specific examples rather than another generic public dataset.

You describe the examples you want and it generates structured synthetic data. If the dataset needs current or real-world facts, you can turn on web search. Exports are live for Hugging Face, Kaggle, S3, and OpenAI.

The first use cases we built around are classifier and eval datasets for trust and safety: grooming detection, harassment detection, security research evals, jailbreak and edge-case sets, and similar work where teams need examples that general-purpose models often refuse to generate.

I marked this as synthetic and paid because the outputs are generated and this is a commercial tool.

Product: https://abliteration.ai/

Synthetic data page: https://abliteration.ai/use-cases/synthetic-data

Launch video: https://x.com/abliteration_ai/status/2054675554138194178

For people who curate datasets: what export format or per-row provenance metadata do you usually need before a generated dataset is usable?

submitted by /u/Effective_Attempt_72
[link] [comments]

STM32H7 Fatigue Detection: 1M Rows → 85k Rows, 512KB RAM, <100ms Inference — Is 4Hz Resampling The Right Move?

Building a real-time fatigue detection system for STM32H7 deployment.

Constraints:

  • 512KB RAM
  • <100ms inference
  • preprocessing on laptop
  • inference on-device only

Dataset:
~1M rows from asynchronous wearable sensors.

Sensor Native Frequency Notes
ACC 32 Hz wrist accelerometer
EDA 4 Hz electrodermal activity
Temp 4 Hz skin temperature
HR 1 Hz heart rate
Breathing 1 Hz respiration
IBI ~0.59 Hz irregular inter-beat interval

Labels:

  • fatigue
  • activity
  • baseline

Current preprocessing strategy:
Resample everything to 4Hz.

Signal Strategy
ACC 32→4Hz mean over 8 samples
EDA/Temp native 4Hz
HR 1→4Hz linear interpolation
Breathing 1→4Hz linear interpolation
IBI ~0.59→4Hz forward-fill

Result:
~1M rows → ~85k synchronized rows.

Current doubts:

  1. ACC to 4Hz: Using only the mean feels too lossy. Should I also include:
  • std
  • max/min
  • magnitude
  • energy

per 250ms window?

  1. IBI: Forward-fill feels mathematically dirty for HRV-related information. Would it be better to:
  • keep IBI irregular
  • compute RMSSD/SDNN at native timing
  • feed only HRV features downstream?
  1. HR/Breathing: Does interpolating 1Hz → 4Hz introduce fake temporal resolution? Would keeping them at 1Hz be cleaner?

Considering switching to a multi-rate pipeline:

Signal Group Frequency
ACC 8 Hz
EDA/Temp 4 Hz
HR/IBI/Breathing 1 Hz

Question:
For embedded ML / TinyML deployment, is multi-rate worth the added pipeline complexity, or is synchronized 4Hz generally the better engineering tradeoff?

Would appreciate advice from anyone working with:

  • wearable signals
  • HRV
  • TinyML
  • embedded inference
  • multimodal physiological data

submitted by /u/Aziz_2002
[link] [comments]

20k Reddit Crypto Sentiment Dataset With Bitcoin Market Labels

I recently created my first public dataset focused on cryptocurrency sentiment analysis and Bitcoin market forecasting. The dataset contains around 20,000 Reddit posts collected from major crypto communities between 2017 and 2025 using the PRAW API.

It includes:

  • Reddit post metadata
  • Cleaned text features
  • Crypto-enhanced VADER sentiment
  • Custom FinBERT sentiment scores
  • Bitcoin prices and returns
  • Binary BTC movement labels for 1h, 6h, 12h, and 24h horizons

The dataset was built for financial NLP, sentiment analysis, and forecasting research. I am still learning dataset engineering and would appreciate feedback, suggestions, or ideas for improvement.

submitted by /u/Cyclo_Studios
[link] [comments]

[self-promotion] Free 20-record Samples (CSV + JSON) Of 20 Dev/AI Datasets — Npm, MCP Servers, HuggingFace Models, Homebrew, Etc.

Hi r/datasets — disclosure first: I sell a paid version of these on Gumroad ($34, 83% off launch). I’m posting the free 20-record samples here because they’re genuinely useful on their own and the mod rules ask self-promotion to be labeled.

What’s in the free samples:

20 niche datasets, each with 20 fully-enriched records as CSV + JSON. ~55,000 records total in the paid version (54,958 as of today). Topics:

  • ai-tools, ai-agents, ai-prompts, ai-models-pricing (13 paid Llama 3.3 70B providers compared)
  • public-apis, mcp-servers (2,971), developer-tools, vscode-extensions
  • self-hosted-software, open-source-alternatives, no-code-lowcode
  • design-resources, cybersecurity-tools
  • npm-packages (top by weekly downloads), homebrew-formulae
  • huggingface-models (top 4,000 by downloads), huggingface-datasets (2,600+)
  • vector-db / RAG ecosystem, ai-agent-frameworks (1,324 records — grew 6.6x in 8 days)

Why I built them:

I kept needing structured, queryable lists of “all the X tools” for filterable directory builds. Awesome-lists and READMEs are great for browsing but useless for jq / SQL / search infrastructure. So I curate, normalize, validate (zero invalid records), enrich with stars/downloads/installs, and refresh.

Per-record fields are typed — categorizationTier rates each record 87-100% specific (vs vague “tool” labels). Open question for the sub: how do you handle tier-of-specificity in your own dataset categorization work? My current rubric is per-dataset config-driven but I’m curious what others do.

Free samples (CSV + JSON, MIT-style permissive): https://github.com/futdevpro/niche-datasets-free

Includes mega-sample.json (5 random records from each of the 20 datasets, 100 records total).

Paid version on Gumroad — $34 launch price (83% off $198 list), monthly refresh on AI Models Pricing because OpenRouter changes weekly, quarterly on the rest. Linked from the GitHub README if anyone wants the full thing.

Happy to answer questions about the catalog, methodology, or specific datasets.

submitted by /u/Jhonny_Ronnie
[link] [comments]

How To Apply Normalization For Cross Sectional Time Series Data ?

I am unable to convince myself to use one method.
Some methods that i thought of were :

  1. I use normalization for full training data of one subject across all features. In this method, i am introducing some kind of lookahead bias, and also this loses on some information which could have been valuable. And also when i want to use one model ( suppose regression with gradient descent) for the subjects combined, then I am unable to judge if this will be a good method.
  2. A bad method was to not care about the subjects, and just normalize across full feature. but this just feels wrong to me.
  3. I was reading about cross sectional normalization which ranks the subjects and does some kind of normalization. But i am unsure how that would be useful.
  4. Another way i found was by using some rolling window, where i keep normalizing not over full data, but the past window data. This seems better but here also what choice of window should be done, and there are lot of questions.

And the bigger problem over all of these is the time series . I would lose quite a lot of information when i don’t consider these. ( although not all features have a big factor of this).

submitted by /u/Virtual-Current6295
[link] [comments]

[self-promotion] I Scraped ~70k Geopolitical Risk Events From Public Feeds. Only About A Quarter Made The News. (Parquet + CSV On HF/Kaggle)

I’ve been building an open dataset of geopolitical and supply chain risk events scraped from public feeds (GDELT, ACLED, GDACS, NASA FIRMS, WHO DON) for the past few months. Around 70k events at this point. The thing that surprised me when I cross-checked against mainstream news coverage: only about a quarter of those events have any major-outlet article attached.

The other ~72% are silent. Flagged in at least one public feed but never picked up by major news. I’d assumed those would all be low-severity noise (small protests, minor weather flags, single-source rumors). They’re not. Roughly a quarter of the silent set is still rated critical or high severity by the source feed itself, which works out to ~14k events nobody covered. ACLED specifically dominates the silent set — local conflict events that don’t make English-language outlets.

The cross-check has obvious limits worth flagging up front: my “news coverage” is a Google News fetch (so paywalled or non-English coverage gets undercounted), and the severity is graded after the fact by an LLM step (so wrong angle on ambiguous events). Both are best-effort. But the headline gap — ~28% news overlap — is just a SQL join, not LLM-dependent. Events are geocoded by region, no PII. Actor names from ACLED are excluded per their license.

The deduplicated event/chokepoint/entity tables are up on Hugging Face and Kaggle as Parquet + a 10% CSV sample, CC-BY-NC-SA. Browsable map version is at tremorwatch.com if you want to poke at individual events first. Curious if anyone has tried something similar at this scale and how you’d refine the coverage definition (different news source mix, embedding-based fuzzy match, etc).

Disclosure: I built this — part of an early-stage startup (Volt AI). Dataset is free under CC-BY-NC-SA, no paid tier exists yet. Posting under r/datasets self-promo guidelines; happy to adjust format if mods prefer.

submitted by /u/Latter_Panda4439
[link] [comments]

[Dataset] NAICS Contagion Map: Topological Edge Network Mapping 1,100+ Supply Chain Cascades Across 340+ Industries

I’m releasing the NAICS Contagion Map, a dataset designed to bridge the gap between the physical economy (NAICS) and financial market taxonomies (GICS).

The goal was to map how volatility in upstream raw materials (Tier 4) systematically ripples down to consumer-facing products (Tier 1). This is particularly useful for anyone doing economic modelling, supply chain resilience analysis, or ESG/Risk research.

What’s inside the CSVs:

  • 340+ NAICS Nodes: Each assigned a Contagion Score (1.0 – 10.0) based on upstream concentration (HHI) and structural importance.
  • 1,100+ Topological Edges: Mapping the exact flow from Tier 4 (Commodities) -> Tier 3 (Extractors) -> Tier 2 (Processors) -> Tier 1 (Assemblers).
  • NAICS to GICS Bridge: Each node is mapped to its financial sector equivalent.

Methodology: This is a derived dataset. The structural tiers and contagion scores were generated via a deterministic heuristic algorithm I built that analyzes industrial interdependencies. While the raw NAICS data is from Census/GICS registries, the relationship mapping (edges) and risk scoring are my original derivation.

Access GitHub Repo

Full Disclosure : I am the creator of this project. I’m sharing this as a free open-source intelligence drop for the community to play with. I’d love to get feedback on the edge logic or hear how you’re using the topology in your own models.

submitted by /u/Vast-Village-2596
[link] [comments]

Open Hantavirus Case Dataset – Aggregated From WHO/CDC/ECDC/PAHO/ProMED, CC-BY-SA JSON API

Sharing a dataset I’ve been maintaining since the MV Hondius hantavirus cluster started in early April.

Aggregated from primary public health sources: WHO Disease Outbreak News, CDC HAN advisories, ECDC bulletins, PAHO weekly reports, ProMED-mail, and national health ministries. Cron pulls every 30 minutes, normalizes case definitions per WHO DON600 framework, geocodes to city or province level where source data permits, dedupes against the archive.

Format: JSON
License: CC-BY-SA 4.0
Endpoint: https://hantaosint.com/api/v1/public.json
Dashboard: https://hantaosint.com
Methodology: https://hantaosint.com/methodology

Fields: case_id, date, country, region, virus_strain, confidence_level (confirmed/suspected/probable/monitoring), source, source_url, lat, lng

Confidence levels are kept separate rather than conflated, which most outbreak trackers don’t bother with. Historical outbreaks included for retrospective analysis: 1993 Four Corners, 2012 Yosemite, 2018-19 Epuyen.

Use cases I built it for: time-series modeling of cluster spread, retrospective comparison of hantavirus outbreaks, surveillance signal for travel medicine research.

Happy to add fields if researchers need additional structure. Open to feedback on the schema and source coverage.

submitted by /u/Professional_Art2346
[link] [comments]

US Capital Punishment (1999): A Curated Dataset Of Judicial Executions For Criminology And Data Science – Zenodo

https://zenodo.org/records/20130055
This curated dataset provides a comprehensive and high-fidelity record of the judicial executions carried out in the United States during the year 1999, which represents the historical peak of capital punishment activity in the modern era.
https://zenodo.org/records/20130055
https://github.com/lightbluetitan/us-capital-punishment-1999

submitted by /u/renzocrossi
[link] [comments]

What Publicly Available Recurring Data Source Do You Repeatedly Search For That Still Doesn’t Exist In Clean Structured Format?

I’m researching gaps in publicly available recurring data that people regularly need for analytics, ML, automation, monitoring, or business workflows.

I’m especially interested in data that is technically public but still difficult to use because it is:

  • trapped in PDFs
  • scattered across websites
  • updated inconsistently
  • available only through dashboards
  • difficult to scrape
  • missing historical archives
  • lacking APIs
  • poorly standardized

Examples could include:

  • government notices
  • procurement/tender data
  • financial filings
  • real-estate listings
  • agriculture pricing
  • shipping/logistics updates
  • business registries
  • market prices
  • legal/regulatory documents
  • municipality/city data

submitted by /u/strange1807
[link] [comments]

What Kind Of Robot Manipulation Datasets Are Teams Actually Looking For Right Now?

I’m trying to understand what robotics and embodied AI teams actually need when collecting real-world training data.

The use cases I keep hearing about are:

-robotic hand manipulation

-grasping and pick-and-place

-soft and fragile object handling

-tabletop tasks

-warehouse tasks

For teams working on imitation learning, VLA models, or robot manipulation, what is usually the biggest bottleneck?

-not enough real-world data

-task diversity

-camera and sensor consistency

-annotation quality

-hardware-specific data

I work with a small team connected to robotic visual data collection, but I’m mainly trying to understand what teams actually need before going too deep in the wrong direction.

submitted by /u/WideAmbition1964
[link] [comments]

Tool For Data Ingestion, Transformation, Orchestrations, And Analysis [self-promotion]

Disclaimer, I’m a developer advocate at Bruin. I previously worked in data analyst and then data engineering roles for almost 10 years, and now at this job I finally have the freedom to play around with data just for fun. This community has always been my go to place to find cool datasets.

That’s why I’m excited to share this announcement with you but I promise to keep the promotional talk very minimal.

I’m sure many of you use AI agents to analyze data, build dashboards, and share them with friends and others. Bruin has a lot of open-source tools for data ingestion, transformation, orchestration, and visualization. Today we are announcing the general availability of Bruin Cloud which is the managed service of those free open-source tools.

I’m personally excited because as a dev advocate I’ve focused mainly on our open-source tools but managing and deploying them locally is sometimes an obstacle for someone that just wants to play around with data – so the free tier (no payment required) version of Bruin Cloud will give you enough credits to get started to run your pipelines but more importantly analyze your data using the AI data analyst and dashboard builder.

Check out the open-source tools: https://github.com/bruin-data

If interested, feel free to check Bruin Cloud too: https://cloud.getbruin.com/register

submitted by /u/uncertainschrodinger
[link] [comments]