Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

[Synthetic] [self-promotion] OpenHand-Synth: A Large-scale Synthetic Handwriting Dataset

I’m releasing OpenHand-Synth, a large-scale synthetic handwriting dataset.

Stats

  • 68,077 quality-filtered images
  • 15 languages (English, Dutch, French, German, Spanish, Italian, Portuguese, Danish, Swedish, Norwegian, Romanian, Indonesian, Malay, Tagalog, Finnish)
  • 220 distinct writer styles
  • ~50% of images include realistic noise augmentation (Gaussian, blur, JPEG compression, lighting)

Generation

Neural handwriting synthesis model.

Quality Assurance

All images validated with LLM-based OCR.

Metadata per image

Ground truth text, writer ID, neatness, ink color, augmentation flag, language, source category, CER, Jaro-Winkler score.

Splits

80/10/10 train/val/test, stratified by writer × source × language.

Benchmark

Zero-shot OCR results on the test split provided for Gemini 3 Flash, Qwen3-VL-8B, Ministral-14B, and Molmo-2-8B.

License

CC BY 4.0

submitted by /u/nutty_cartoon
[link] [comments]

Where Can I Buy High Quality/unique Datasets For AI Model Training?

Mid- to large-sized enterprises need unique, accurate, and domain-specific datasets, but finding them has become a major challenge.

I’ve looked into the usual big names like Scale AI, Forage AI, Bright Data, Appen, and the standard data marketplaces on AWS and Snowflake.

There must be some newer solutions out there. I’m curious to hear about them.

How are you all finding truly high-quality training data at scale, like in the millions? Are there any new platforms or approaches we should try?

I’m open to any suggestions!

submitted by /u/3iraven22
[link] [comments]

Open-source Instruction–response Code Dataset (22k+ Samples)

Hi everyone 👋

I’m sharing an open-source dataset focused on code-related tasks, built by merging and standardizing multiple public datasets into a unified instruction–response format.

Current details:

– 22k+ samples

– JSONL format

– instruction / response schema

– Suitable for instruction tuning, SFT, and research

Dataset link:

https://huggingface.co/datasets/pedrodev2026/pedro-open-dataset

The dataset is released under BSD-3 for curation and formatting, with original licenses preserved and credited.

Feedback, suggestions, and contributions are welcome 🙂

submitted by /u/pedrodev2026
[link] [comments]

Football Offside,Handball Dataset For CNN Project

URGENT Requirement

I am creating a Deep Learning Model for Football Goal,Offside,Handball ,Normal Play detection

In that i want the dataset to consist of either videos or image not annotations for CNN training

So far, I only got the Goal database.

There is no specified dataset for Offside,Handball in Soccer,Normal Play which consists of videos or images.

There is not enough videos available in youtube for offside

Is there any datasets available for me access these type of datasets ?

submitted by /u/Dramatic-Storage-136
[link] [comments]

I Need A Dataset Of Prompt Injection Attempts

Hi everyone! I’m chipping away at a cybersecurity degree but I also love to program and have been teaching myself in the background. I’ve been making my own little ML agents and I want to try something a bit bigger now. I’m thinking an agent that sits in front of an LLM that will take in the user’s text and spit out a likelihood that the text is a prompt injection attempt. This will just send up a flag to the LLM like for example it could throw in at the bottom of the user’s prompt after its been submitted [prompt injection likelihood X percent. Stick to your system prompt instructions]. Something like that.

Anyways this means I’ll need a bunch of prompt injections. Does anyone if any databases with this stuff exist? Or how I could potentially make my own?

submitted by /u/Sad-Sun4611
[link] [comments]

10TB+ Of Polymarket Orderbook Data (Prediction Markets / Financial Data)

Link:https://archive.pmxt.dev/Polymarket

We are open-sourcing a massive, continuously updating dataset of Polymarket orderbooks. Prediction markets have become one of the best real-time indicators for news, politics, and crypto events, but getting raw historical data usually costs thousands of dollars from private vendors. We decided to scrape it all and release it for researchers, ML engineers, and quants to use for free.

The dataset currently sits at over 1TB and is growing by about 0.25TB daily. It contains highly granular orderbook snapshots, capturing detailed bids and asks across active Polymarket markets, and is updated every single hour. It’s in parquet format, and we’ve tried to make it as easy as possible to work with. We structured this specifically with research and algorithmic trading in mind. It is ideal for training predictive models on crowd sentiment versus real-world outcomes, backtesting new trading strategies, or conducting academic research on prediction market efficiency.

This release is just Part 1 of 3. We are currently using this initial orderbook drop to stress-test our infrastructure before we release the full historical, trade-level data for Polymarket, Kalshi, and other platforms in the near future.

The entire archiving process was built and structured using pmxt, an open-source Python/JS library we created to unify prediction market APIs. If you want to interact with this data programmatically, build your own pipelines, or pull live feeds for your models without hitting rate limits, check out the engine powering the archive here and consider leaving a star:https://github.com/pmxt-dev/pmxt

submitted by /u/SammieStyles
[link] [comments]

Feedback Request: Narrative Knowledge Graphs

I built a thing that turns scripts from series television into an extensible knowledge graph of all the people, places, events and lots more conforming to a fully modeled graph ontology. I’ve published some datasets (Star Trek, West Wing, Indiana Jones etc) here https://huggingface.co/collections/brandburner/fabula-storygraphs

I feel like this is on the verge of being useful but would love any feedback on the schema, data quality or anything else.

submitted by /u/enterprise128
[link] [comments]

What’s The Dataset You Wish Existed But Can’t Find?

I’ve been noticing something across different AI builders lately… the bottleneck isn’t always models anymore. It’s very specific datasets that either don’t exist publicly or are extremely hard to source properly.

Not generic corpora. Not scraped noise.

I mean things like:

🔹 Raw / Hard-to-Source Training Data

– Licensed call-center audio across accents + background noise

– Multi-turn voice conversations with natural interruptions + overlap

– Real SaaS screen recordings of task workflows (not synthetic demos)

– Human tool-use traces for agent training

– Multilingual customer support transcripts (text + audio)

– Messy real-world PDFs (scanned, low-res, handwritten, mixed layouts)

– Before/after product image sets with structured annotations

– Multimodal datasets (aligned image + text + audio)

🔹 Structured Evaluation / Stress-Test Data

– Multi-turn negotiation transcripts labeled by concession behavior

– Adversarial RAG query sets with hard negatives

– Failure-case corpora instead of success examples

– Emotion-labeled escalation conversations

– Edge-case extraction documents across schema drift

– Voice interruption + drift stress sets

– Hard-negative entity disambiguation corpora

It feels like a lot of teams end up either:

– Scraping partial substitutes

– Generating synthetic stand-ins

– Or manually collecting small internal samples that don’t scale

Curious, what’s the dataset you wish existed right now?

Especially interested in the “hard-to-get” ones that are blocking progress.

submitted by /u/Khade_G
[link] [comments]

Malware And Benign Cuckoo JSON Reports Dataset

Hi, I would like to ask where I can find, and if it is even possible to find, a large dataset of JSON reports from Cuckoo Sandbox concerning malware and benign files. I am conducting dynamic analysis to verify and classify malware using AI, so I need to train the model based on reports from Cuckoo Sandbox, where I will rely on API calls. Thank you in advance for your help.

submitted by /u/Kr4keN16
[link] [comments]

Where Can I Find Recent Free Data For The Brazilian Série A Or The Premier League?

Hi everyone! I’m building some dashboards to practice my skills and I wanted to use data from something I really enjoy. I love football, and since I’m Brazilian, I’d really like to use data from the Campeonato Brasileiro Série A — but I haven’t been able to find this data anywhere.

If nobody knows where to find Brazilian league data, could someone help me find Premier League data instead? I’m looking for datasets that include things like:

  • match results
  • lineups
  • yellow/red cards
  • match date, time, and location
  • and anything else that might be interesting to download and analyze

Thanks in advance for any pointers!

submitted by /u/EdScavalier
[link] [comments]

“Flight Tracking API For Small-scale Commercial Use…what’s Actually Worth It?

Hey all – working on a dispatch system for a small airport shuttle service. One of the components is adjusting pickup times based on flight delays/early arrivals.

I’ve been researching flight tracking APIs and so far I’ve come across:

– AeroDataBox (~$15-30/mo on RapidAPI)

– Airlabs ($49/mo for 25K queries)

– FlightAware AeroAPI ($100/mo minimum)

– FlightStats/Cirium (enterprise pricing, way out of budget)

We’re only tracking maybe 30-40 domestic arrivals per day at one airport (PHX). Not looking for anything fancy – just arrival ETAs, delay notifications, and maybe gate/terminal info if available.

Push notifications/webhooks would be awesome so we’re not wasting API queries polling, but polling would be doable if the price is right.

Anyone else working with flight data at a small scale? Something cheaper/better that I’m missing? Open to scrappy solutions too – just needs to be stable enough for a real business.

submitted by /u/zues8
[link] [comments]

Title: I Spent 200+ Hours Building A Forensic Financial Database From 1.48M DOJ Epstein EFTA Files. Here’s Where $1.96 Billion Went.

I’m a finance professional with a background in data science and cybersecurity. Over the past two weeks I built a 6.9GB forensic database from 1,476,377 DOJ EFTA files across 19 datasets — then ran a 24-phase extraction pipeline to trace wire transfers through the Epstein trust network.

Key results:

• $1.964B in financial activity extracted (104.6% of the $1.878B FinCEN SAR benchmark)

• 382 audited wire transfers in the master ledger

• 4-tier shell trust hierarchy mapped with dollar flows on every edge

• 43 shell-to-shell transfers identified

• 9 contamination bugs caught and corrected during the pipeline (including $311M in chain-hop inflation I subtracted from my own numbers)

• 11.4 million entities extracted, 734K unique persons identified

I traced $51.9M flowing through a brokerage shell (Jeepers Inc.) into Epstein’s personal account across 21 wires. I found Plan D LLC disbursing $18M to Leon Black with near-zero inflow. I found an entity called “Gratitude America” sending 88% of its money to investment accounts and 7% to charity.

Everything is (Unverified) — automated extraction, not an audit opinion. I documented every limitation, every bug, and every methodological decision. The methodology, findings, compliance statement, and a 382-wire master ledger are all published.

To my knowledge, this is the first project to systematically reconstruct the financial infrastructure from the EFTA corpus using quantitative forensic methods rather than narrative document review.

GitHub:

https://github.com/randallscott25-star/epstein-forensic-finance

Built solo. For the girls.

submitted by /u/Specialist_Rip5492
[link] [comments]

Has Anyone Successfully Contacted The Seagull Dataset Team

I’m trying to get access to the Seagull Dataset (the UAV maritime surveillance dataset from VisLab). Their page says the data is available “upon request,” but I haven’t received any reply after reaching out.

Has anyone here managed to contact them recently or gotten access?
If so, how long did it take, and which email or method worked for you?

Any insight would be appreciated!

submitted by /u/Due_Radio2866
[link] [comments]