Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Central Bank Monetary Policy Dataset – 12 Banks, 5000+ Documents, Sentiment Labels

Released a dataset of central bank communications with NLP sentiment labels. Contents:

  • 12 central banks (Fed, ECB, BOE, BOJ, PBOC, RBA, etc.)
  • Policy statements, minutes, speeches
  • Sentence-level hawkish/dovish/neutral labels
  • Economic indicators (rates, FX, GDP, inflation)

Dashboard: https://monetary.ivan.digital Huggingface: https://huggingface.co/datasets/aufklarer/central-bank-communications

submitted by /u/ivan_digital
[link] [comments]

Executive Compensation Dataset Extracted From 100k+ SEC Filings (2005-2022)

I built a pipeline to extract Summary Compensation Tables from SEC DEF-14A proxy statements and turn them into structured JSON.

Each record contains: executive name, title, fiscal year, salary, bonus, stock awards, option awards, non-equity incentive, change in pension, other compensation, and total.

The pipeline is running on ~100k filings to build a dataset covering all US public companies from 2005 to today. A sample is up on HuggingFace, full dataset coming when processing is done.

GitHub: https://github.com/pierpierpy/Execcomp-AI

HuggingFace sample: https://huggingface.co/datasets/pierjoe/execcomp-ai-sample

submitted by /u/Logical_Delivery8331
[link] [comments]

Anyone Seeing AI Agents Consume Paid Datasets Yet?

I’m a founder doing some early research and wanted to get a pulse check from folks here.

I’m seeing more AI agents and automated workflows directly calling data APIs (instead of humans or companies manually integrating). It made me wonder whether, over time, agents might become real “buyers” of datasets, paying per use or per request.

Curious how people here are seeing this. Does the idea of agents paying directly for data make sense, or feel unrealistic?

Just trying to understand how dataset creators and sellers are thinking about this shift, or whether it’s too early/overhyped.

Would love to hear any honest takes!

submitted by /u/Shot_Fudge_6195
[link] [comments]

Compileo – Open Source Data Engineering And Dataset Generation Suite For AI Fine Tuning And Other Applications

**Disclaimer – I am the developer of the software

Hello,

I’m a physician-scientist and AI engineer (attempting to combine the two professionally, not that easy to find such opportunities so far). I developed an AI-powered clinical note and coding software but when attempted to improve outcomes via fine tuning of LLMs, became frustrated by the limitations of open source data engineering solutions at the time.

Therefore, I built Compileo—a comprehensive suite to turn raw documents (PDF, Docx, Power Point, Web) into high quality fine tuning datasets.

**Why Compileo?**
* **Smart Parsing:** Auto-detects if you need cheap OCR or expensive VLM processing and parses documents with complex structures (tables, images, and so on).
* **Advanced Chunking:** 8+ strategies including Semantic, Schema, and **AI-Assist** (let the AI decide how to split your text).
* **Structured Data:** Auto-generate taxonomies and extract context-aware entities.
* **Model Agnostic:** Run locally (Ollama, HF) or on the cloud (Gemini, Grok, GPT). No GPU needed for cloud use.
* **Developer Friendly:** Robust Job Queue, Python/Docker support, and full control via **GUI, CLI, or REST API**.

Includes a 6-step Wizard for quick starts and a plugin system (built-in web scraping & flashcards included) for developers so that Compileo can be expanded with ease.

https://github.com/SunPCSolutions/Compileo

submitted by /u/redyforeddit
[link] [comments]

Stream Huge HugginFace And Kaggle Datasets

Greetings. I am trying to train an OCR system on huge datasets, namely:

They contain millions of images, and are all in different formats – WebDataset, zip with folders, etc. I will be experimenting with different hyperparameters locally on my M2 Mac, and then training on a Vast.ai server.

The thing is, I don’t have enough space to fit even one of these datasets at a time on my personal laptop, and I don’t want to use permanent storage on the server. The reason is that I want to rent the server for as short of a time as possible. If I have to instantiate server instances multiple times (e.g. in case of starting all over), I will waste several hours every time to download the datasets. Therefore, I think that streaming the datasets is a flexible option that would solve my problems both locally on my laptop, and on the server.
However, two of the datasets are available on Hugging Face, and one – only on Kaggle, where I can’t stream it from. Furthermore, I expect to hit rate limits when streaming the datasets from Hugging Face.

Having said all of this, I consider just uploading the data to Google Cloud Buckets, and use the Google Cloud Connector for PyTorch to efficiently stream the datasets. This way I get a dataset-agnostic way of streaming the data. The interface directly inherits from PyTorch Dataset:

from dataflux_pytorch import dataflux_iterable_dataset PREFIX = "simple-demo-dataset" iterable_dataset = dataflux_iterable_dataset.DataFluxIterableDataset( project_name=PROJECT_ID, bucket_name=BUCKET_NAME, config=dataflux_mapstyle_dataset.Config(prefix=PREFIX) ) 

The iterable_dataset now represents an iterable over data samples.

I have two questions:

  1. Are my assumptions correct and is it worth uploading everything to Google Cloud Buckets (assuming I pick locations close to my working location and my server location, enable hierarchical storage, use prefixes, etc.). Or I should just stream the Hugging Face datasets, download the Kaggle dataset, and call it a day?
  2. If uploading everything to Google Cloud Buckets is worth it, how do I store the datasets to GCP Buckets in the first place? This and this tutorials only work with images, not with image-string pairs.

submitted by /u/Suspicious-Pick-7961
[link] [comments]

Synthetic Infant Detection Dataset (version 2)

Earlier this year, I wrote a path tracing program that randomized a 3D scene of a toddler in a crib, in order to generate synthetic training data for an computer vision model. I posted about it here.

I made this for the DIY infant monitor I made for my son. My wife and I are now about to have our second kid, and consequently I decided to revisit this dataset/model/software and release a version 2.

In this version, I used Stable Diffusion and Mid Journey to generate images for training the model. These ended up being way more realistic and diverse. I paid a few hundred dollars to generate over a thousand training images and videos (useful for testing detection + tracking). I labeled them manually, with LabelMe. Right now, all images have segmentation masks, but I’m in the middle of adding bounding boxes (will add key points, after that, for pose estimation).

To make sure this dataset actually works in practice, I created a “reference model” to train. I used various different backbones, settling on MobileNet V3 (small) and a shallow U-Net detection head. The results were pretty good, and I’m now using it in my DIY infant monitoring system.

Anyway, you can find the repo here and download the dataset, which is a flat numpy array, on Kaggle

Cheers!

PS: Just to be clear, I made this dataset, it is synthetic (GenAI), it is not a paid dataset.

submitted by /u/taylorcholberton
[link] [comments]

How Do You All Do Data Labelling/annotation?

Hi! First – please forgive me if this is a stupid question / solved problem, but I’m sort of new to this space, and curious. How have you all dealt with creating labelled datasets for your use cases?

E.g

  • what tool(s) did you use? I’ve looked into a few like Prolific (not free), Label studio (free), and I’ve looked at a few other websites
  • how did you approach recruiting participants/data annotators? e.g. did you work with a company like Outlier, or did you recruit contractors, or maybe you brought them on full-time?
  • Building on that, how did you handle collaboration and consensus if you used multiple annotators for the same row/task? or more broadly, quality control?

Seems like hard problems to me…would appreciate any insight or advice you have from your experiences! Thanks so much!

submitted by /u/Advanced-Park1031
[link] [comments]

[FREE] 100K+ Domain Technographics (November 2025)

This dataset contains tech fingerprinted in the headers and body from HTTP responses across 100K+ domains. It also includes the IP address used in the HTTP response, its origin country and its ASN.

https://www.dropbox.com/scl/fi/vr417dfkv8ia2xzil98b2/nov_2025_all_samples.zip?rlkey=7l6nrhvrrjzop2l6d5wgv6bti&e=1&st=fra1zbgo&dl=0

The dataset is compiled from all the samples currently available at: https://versiondb.io

Have fun!

submitted by /u/Upper-Character-6743
[link] [comments]

Gathering Key Data About Medical Practices

I’m new to data engineering, and I’m currently trying to get website links for medical practices. I have their name, state, specialty and some other key info about the tech they use, but there’s no catch-all dataset I think that has working website links or anything that leads to that. I was thinking of using scraping tools, but not sure if they are known to be accurate or which one to use. I’m willing to use free or paid approaches, just not sure how to get this data with 80% confidence it’s accurate.

submitted by /u/Special-Sock968
[link] [comments]

Struggling To Extract Data From 1,500+ Mixed Scanned/digital PDFs. Tesseract, OCR, And Vision LLMs All Failing. Need Advice.

Hi everyone,

I am working on my thesis and I have a dataset of about 1,500 PDF reports from the DGHS (Health Services). I need to extract specific table rows (District-wise Dengue stats) from them.

The Problem: The PDFs are a nightmare mix. Some are digital with selectable text, but many are low-quality scans or photos of paper reports. The fonts are often garbled (mojibake) when extracted as text, and the layout changes slightly between years.

What I have tried so far (and why it failed):

  1. Tesseract OCR: It struggled hard with the Bengali/English mix and the table borders. The output was mostly noise.
  2. Standard PDF scraping (pdfplumber/PyPDF): Works on the digital files, but returns garbage characters (e.g., Kg‡dvU instead of “Chittagong”) due to bad font encoding in the source files.
  3. Ollama (Llama 3.1 & MiniCPM-V):
    • Llama 3.1 (Text): Hallucinates numbers or crashes when it sees the garbled text.
    • MiniCPM-V (Vision): This was my best bet. I wrote a script to convert pages to images and feed them to the model. It works for about 10 files, but then it starts hallucinating or missing rows entirely, and it’s very slow.

The Goal: I just need to reliably extract the District Name, New Cases, Total Cases, and Deaths for a specific division (Chittagong) into a CSV.

I have attached a screenshot of one of the “bad” scanned pages.

Has anyone successfully processed a mixed-quality dataset like this? Should I be fine-tuning a small model, or is there a specific OCR pipeline (like PaddleOCR or DocumentAI) that handles this better than raw LLMs?

Any pointers would be a lifesaver. I’m drowning in manual data entry right now.

submitted by /u/deletedusssr
[link] [comments]

How Do You Efficiently Pre-filter And Group WhatsApp Numbers To Boost Engagement?

Hey everyone,

Lately I’ve been playing around with a little workflow for screening WhatsApp numbers. The idea’s pretty simple: figure out which numbers are actually active and get a sense of engagement, without bothering anyone. It’s super handy if you need to quickly group contacts or analyze interaction rates.

I realized that just four fields can filter out around 60% of low-value numbers: number | last_seen | replied | bounce.

I wrote a few simple scripts for pre-filtering, but some steps felt kinda repetitive, so I started using a small tool (TNTwuYou) to handle list validation and reply tracking.

Some things I’ve tried:

  • Sorting numbers by last active date, so you hit the active folks first.
  • Grouping contacts based on reply status.
  • Using simple scripts with data to get a clear picture of which regions or types of people are more likely to engage.

Has anyone done reply probability scoring?

  • Do you base it on a time window or historical reply rate?
  • Anyone tried using graph or clustering methods for grouping contacts?

submitted by /u/Suspicious_Prior4515
[link] [comments]

Has Anyone Tried Letting AI Agents Access Your Data And Pay Per Request?

I’m curious whether anyone here has actually tried letting AI agents directly access a dataset or data API and pay based on usage (e.g. per request or per query).

I’ve seen ideas around usage-based APIs and agent tool-calling, but I’m not sure how this works in practice when the client isn’t a human. Did it make sense economically? Were abuse, pricing, or access control big issues?

Would love to hear if people have similar ideas with this or decided it wasn’t worth giving a try.

submitted by /u/Shot_Fudge_6195
[link] [comments]

Open Source Or Cheap Alternative To GICS/ICB Security Industry Sectors

GICS (The Global Industry Classification Standard from MSCI) and ICB (Industry Classification Benchmark from FTSE/LSE/Dow Jones) seem to dominate the securities industry sector data market.

There are alternatives available from players such at ICE, but in all cases, they are proprietary, and as far as i can tell pretty much identical.

11 top level sectors, which are then split into more and more granular sub-categories.

I’m fairly certain that nobody really has any use for the most granular sub-sectors which contain >160 sectors… But the high and mid level classifications would be really useful.

You can theoretically grab sector weightings data from Yahoo Finance by ticker code… But i’d ideally like to be able to use either Sedol or ISIN to look values up.

I’m sure there are others who would like something like this, so before i think about trying to create my own gizmo for it i was wondering if anybody has done anything similar?

submitted by /u/VivicaFromGsyEh
[link] [comments]

[PAID] I Compiled A Clean JSON Dataset Of All Japanese Prefectures And 1,700+ Cities For Developers [self-promotion]

I’m working on a project that required accurate hierarchical Japanese location data
(prefecture → city/ward/town/village).

Since most publicly available datasets were outdated, inconsistent, or missing entries,
I compiled a clean version from multiple official sources.

It includes:

  • 1 country
  • 47 prefectures
  • 1,700+ municipalities
  • consistent hierarchical IDs
  • UTF-8, machine-friendly
  • suitable for forms, address validation, GIS, ML, and location-based apps

If anyone is interested, I’m happy to provide details or export it as CSV / SQL.

The full JSON dataset is available here (paid):
https://makotocroco.gumroad.com/l/japan-locations

(self-promotion: this is my own dataset)

submitted by /u/Specialist-Weight407
[link] [comments]

Dataset Release: Real Structural Engineering Drawings For AI (PNED – 6 RC Datasets)

Hi everyone,

I’ve been working as a structural engineer for about 10 years (Germany, RC design).
Over the last few years I’ve noticed something very surprising in AI/ML:

We have datasets for almost everything — but none for real structural engineering drawings.

These drawings are extremely challenging for machine learning due to:

  • dense, overlapping geometry
  • structural symbols and reinforcement notation
  • dimensions, leaders, section markers
  • multi-layer technical detailing
  • scale-dependent information
  • mixed text + geometry + symbols

Because of this, they are highly relevant for:

  • OCR / document understanding
  • object detection
  • layout analysis
  • symbol recognition
  • segmentation
  • BIM automation
  • engineering-focused CV research

So I started building a series of datasets of real reinforced-concrete drawings, created specifically for ML tasks.

Each dataset contains:

  • 25 PDF engineering drawings (Columns 50 PDF)
  • 25 PNG images (1200 dpi) (Columns 50 PDF)
  • one structural category per dataset (RC beams, walls, foundations, columns, precast columns, etc.)

So far I’ve released 6 datasets:

  • RC Beams V1
  • RC Columns V1
  • RC Foundations V1
  • RC Precast Columns V1
  • RC Walls V1
  • RC Walls V2

All datasets, including sample images, can be viewed here:

👉 [https://huggingface.co/PNEngineeringDatasets]()

I’d be happy to hear any feedback, suggestions or use cases you think could be valuable for ML research in this domain.

Disclaimer: this is my own dataset project; posting once for visibility.

submitted by /u/PNEngineeringDataset
[link] [comments]

Where To Find Monolingual Dictionary Dataset For Multiple Languages

Hello guys. Any idea where I could get a free dataset containing monolingual dictionary (word- definition pairs in the same language) in multiple languages? I got english from kaikki(wiktionary) but it is missing other language ‘senses’. WordNet might be no good, since I need sensible definitions. I’m considering making it myself from the wiktionary dumps of different languages, but I thought it might be better to ask first

submitted by /u/JamesAntoni
[link] [comments]