Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

My Experience With QuantumProxies (1$/ Gb Residential Proxy)

Started using QuantumProxies recently after trying a few other providers, and it’s actually been pretty decent so far.

Residential proxies start at $1/GB, which caught my attention first, but the performance has also been reliable. Speeds are stable, setup was simple, and I haven’t really had issues with downtime yet.

Not saying it’s the best provider out there, but for the price it’s been worth it in my experience.

https://quantumproxies.io

submitted by /u/Individual-Waltz6599
[link] [comments]

We Just Captured 1800+ Human Motion Sequences For AI Model Training. Here’s What 4 Days Of Continuous Motion Capture Looks Like.

Just wrapped a 4-day motion capture dataset shoot at our studio in India. Wanted to share some behind-the-scenes since motion data is becoming increasingly critical for humanoid robot training and imitation learning.

What we did:

  • 12 actors
  • Continuous day + night shooting
  • Structured locomotion and action datasets
  • High-volume capture (1800+ sequences)
  • 24-hour production cycles to meet deadline

What’s interesting about this:

Most AI/ML teams working on humanoid control or embodied AI are stuck with either:

  1. Low-quality synthetic data
  2. Academic datasets that don’t scale
  3. Building their own infrastructure (expensive)

We realized professional motion capture studios have the infrastructure already built. So we’re now offering this as a service specifically for ML teams.

The dataset we captured is structured for imitation learning — actions, locomotion, complex movements. Not cinematic. Not game-ready. Built specifically for training.

If you’re working on humanoid robotics, gesture recognition, or motion-based ML models and need real human movement data, this is now available as a service.

More details: www.appleartsstudios.com

Happy to answer questions about dataset format, motion capture quality, or scaling.

submitted by /u/PossiblePotato961
[link] [comments]

I Analyzed 2,300+ UK Dental Clinics — Most Are Missing This

I analyzed 2,300+ UK dental practices and found something surprising:

– ~55% don’t have a Meta Pixel installed

– Many still rely on outdated or no booking systems

– Tracking and attribution are almost nonexistent

Meaning: a huge number of clinics are not ready for proper paid ads or funnel optimization.

I mapped emails, phones, and tech stack (GA, CMS, booking systems) across 80+ cities.

If you’re working in dental marketing, SaaS, or lead gen — how would you use this kind of data?

Curious to hear ideas. Happy to share a small sample if useful.

submitted by /u/RowStunning5177
[link] [comments]

Open Source Tool For Generating And Cleaning Synthetic Instruction-tuning Datasets

Built this because I wanted a reproducible way to build fine-tuning datasets without doing it all by hand.

You give it seed prompts or an existing dataset, it generates instruction-output pairs via any OpenRouter model, scores them with a local or remote LLM judge, and exports a clean JSONL you can use directly for training.

You can also ingest datasets straight from HuggingFace and filter or relabel them through the same pipeline.

The export step lets you set a score threshold and a train/val split ratio so what comes out is ready to use.

MIT licensed, everything is stored locally, no data leaves your machine unless you choose a cloud judge backend.

Github project link is in comments below 👇

submitted by /u/gvij
[link] [comments]

[Dataset] [self-promotion] Curated Brain Regeneration Research Dataset: 44,500+ Papers + 18,800+ Clinical Trials Across 19 Sources, Organized By Expert Research Team, Open API

What it is

Brain-Regeneration.com is an open observatory tracking the science of brain repair and neurodegeneration. The dataset behind it aggregates papers and clinical trials across 19 sources — including PubMed, bioRxiv, medRxiv, The Lancet, Nature, PNAS, WHO trial search, ClinicalTrials.gov, and the EU Clinical Trials Register.

Current counts:

  • 44,510 papers
  • 18,883 clinical trials
  • 226,850 authors indexed

What makes it different from a PubMed export

The data is organized by expert research teams (groups at Cambridge, the University of Coimbra, and iMed.ULisboa), which gives you a built-in faceting dimension for slicing the corpus. Each team has its own endpoint, so you can query by research group rather than just keyword.

The API

Public and open, no auth required:

Possible use cases

  • Training or benchmarking domain-specific NLP models on a high-signal neuroscience corpus
  • Mapping research activity timelines against clinical trial registration patterns
  • Citation and author network analysis within a curated subfield

Full API docs at https://github.com/brunoamaral/gregory-ai/blob/main/docs/03-api-and-rss-feeds.md . Happy to answer questions about the data structure or coverage.

submitted by /u/brunoamaral
[link] [comments]

Finding The Full Multi-PIE Dataset (face Pictures)

There is a dataset called “Multi-PIE” that I’m trying to find but I only have some vague references:

How can I obtain the full dataset?

submitted by /u/GJani
[link] [comments]

Looking For Emergency Triage Dataset With Chief Complaint Text + Vitals

I’m looking for an open/public dataset with columns like:

  • Chief complaint / symptoms / reason for visit
  • Age and gender
  • Heart rate
  • Blood pressure
  • SpO2 / oxygen saturation
  • Temperature
  • Respiratory rate
  • Pain score
  • Triage level / acuity / severity label
  • Diagnosis or discharge outcome, if available
  • Department/speciality label, if available

I already know about MIMIC-IV-ED, but it requires PhysioNet credentialing and CITI training, so I’m looking for easier-to-access Kaggle or public alternatives.

Any dataset suggestions would be appreciated.

Thanks!

submitted by /u/Serious_Ad_5036
[link] [comments]

PiC/phrase_retrieval Dataset (PR-pass & PR-page) Is Broken — Does Anyone Have A Local Copy?

Hey everyone,

I’ve been trying to use the ‘PiC (Phrase-in-Context) Phrase Retrieval dataset from HuggingFace (`PiC/phrase_retrieval`, configs: PR-pass and PR-page) but the loader is broken because the underlying data files hosted at `auburn.edu/~tmp0038/PiC/` are returning a ‘403 Forbidden’ error.

The HuggingFace dataset loader depends entirely on that external Auburn University server, so the dataset is currently unusable for anyone trying to load it programmatically.

I’ve already reached out to the authors (Thang Pham and Anh Tran), but unfortunately got no positive response yet.

If anyone: Downloaded this dataset before the server went down and has the raw JSON files (`train-v1.0.json`, `dev-v1.0.json`, `test-v1.0.json`) for either PR-pass or PR-page

I would really appreciate if you could share. I’m also happy to re-host the files on HuggingFace properly once recovered, so the community doesn’t run into this again.

Thanks in advance!

submitted by /u/BugSolid3436
[link] [comments]

[OC] Usenet Corpus 1980–2013 — 103B Tokens, 408M Posts, 9 Hierarchies, Fully Processed

Shared this on r/MachineLearning a few days ago and got good discussion (30K views, 100+ upvotes) — figured this community would want to know about it too since it’s more directly relevant here.

I’ve spent the last several years building and processing a complete Usenet corpus and finally have it documented well enough to share properly.

What it is: A deduplicated, sanitized collection of Usenet posts from 1980 through 2013 — covering the full arc of Usenet from its academic origins through peak adoption to decline. Pre-web, pre-social media, pre-AI. Entirely human-generated.

Stats:

  • 103.1 billion tokens (cl100k_base)
  • 408,236,288 posts
  • 18,347 newsgroups
  • 9 top-level hierarchies: alt, rec, comp, soc, sci, misc, news, talk, humanities

Processing applied:

  • alt.binaries.* excluded entirely at hierarchy level (UUencoded/base64 binary content)
  • Adult content newsgroups excluded at hierarchy level
  • Record-level: deduplication by Message-ID, binary detection and removal, PII redaction (email addresses replaced with [email] token, Message-IDs SHA-256 hashed), sensitive content removal
  • Language detection on every record (fasttext LID-176) — 96.6% English, 100+ languages total
  • Format: gzip-compressed JSONL, ~141GB compressed

Schema:

{ "text": "post body", "group": "comp.lang.python", "date": "1995-03-14", "subject": "Re: thread subject", "author": "Display Name", "id": "msg-<sha256hex>" } 

Samples: 11 sample files (5K posts per hierarchy + combined sets) are freely available — no approval needed. Full corpus available for licensing.

Dataset has also been added to the AI datasets directory at lifearchitect.ai/datasets-table.

Link in comments.

submitted by /u/OwnerByDane
[link] [comments]

Best Way To Clean GitHub Data (remove Node_modules, Lockfiles, Etc) For LLM Fine-tuning?

Anyone else wasting hours cleaning GitHub data for LLM fine-tuning?

I tried building my own dataset (instead of relying on Hugging Face), but scraping repos is messy node_modules, lockfiles, minified code, binaries… tons of junk.

Feels like more time goes into cleaning than actual training.

Curious how you’re handling this:

custom scripts?

existing tools?

or just manual cleanup?

Also how are you structuring data for different LLM formats?

Thinking about building something to automate this if it’s a common problem..

Would love to hear workflows you guys work with.

submitted by /u/Ok_Rub3312
[link] [comments]

PiC/phrase_retrieval Dataset (PR-pass & PR-page) Is Broken — Does Anyone Have A Local Copy?

Hey everyone,

I’ve been trying to use the ‘PiC (Phrase-in-Context) Phrase Retrieval dataset from HuggingFace (`PiC/phrase_retrieval`, configs: PR-pass and PR-page) but the loader is broken because the underlying data files hosted at `auburn.edu/~tmp0038/PiC/` are returning a ‘403 Forbidden’ error.

The HuggingFace dataset loader depends entirely on that external Auburn University server, so the dataset is currently unusable for anyone trying to load it programmatically.

I’ve already reached out to the authors (Thang Pham and Anh), but unfortunately got no positive response yet.

If anyone: Downloaded this dataset before the server went down and has the raw JSON files (`train-v1.0.json`, `dev-v1.0.json`, `test-v1.0.json`) for either PR-pass or PR-page; I would really appreciate if you could share.

Thanks in advance!

submitted by /u/BugSolid3436
[link] [comments]

Why Doesnt Anyone Share Full Historical Tick Data Here. I Have Financial Data L3 OrderFlow Data

i don’t want to pay for historical tick data, prefer free historical tick data. on all available assets.
– the question, why does no one ever share the data that they paid for, obviously i am aware its because people pay a lot of money for such historical tick data.

that being said. i am willing to give up my gate keeping, only if this thread produces links to historical tick data for free that was paid(must be clean, for research purposes)

– i will publicly share a link that openly shares: Level 3 OrderFlow that goes back to feburary/2026(yes level 3 OrderFlow that is locked to institutions with pockets)
– this website houses 1min Options Data from polygon.io
– Historical economic calendar data for 20 countries, 125,368 events spanning 2015 to 2026. Includes event names, actual vs consensus vs previous values, and period information.
– Complete fundamental data for 984 companies. Each download includes the company profile, income statements, balance sheets, cash flows, key metrics, financial ratios, growth rates, earnings calendar, and insider trades.
– OHLCV candle data across 25,008 datasets.4,168 symbols in 6 timeframes from 1 minute to daily. One CSV.gz file per year per timeframe per symbol.

but it does not have tick data.

submitted by /u/liquidatedis
[link] [comments]

I Got Tired Of Checking Kaggle, HuggingFace, Data.gov, And Other Sites Every Time I Needed A Dataset, So I Built A Tool That Searches All Of Them At Once

Disclosure: I’m one of the creators of this tool.

Hi all,

I do ML research at Berkeley and the most tedious part of every project is dataset discovery. I’d spend hours opening tabs across Kaggle, HuggingFace, data.gov, Census, WHO, Semantic Scholar, and a dozen other platforms just to find the right data. Then I’d have to manually check licenses, preview columns, and figure out citations.

So my friend and I built Mobus, an open-source MCP server that lets you do all of that from inside Claude or Cursor. You describe what you need in natural language and it searches across 20 platforms, lets you preview the actual data, checks licenses, and generates citations.

It’s free and open source: https://github.com/mobus-ai/Mobus

Quick demo on the site if you want to see it in action: https://mobus.ai

Would love feedback from anyone who deals with this pain point. What data sources are missing that you’d want to see added?

submitted by /u/Swimming_Outside_988
[link] [comments]

Seeking A Dataset Of English Lemmas With Recognizability Scores

I checked out the word prevalence dataset of 62,000 lemmas. But it has some limitations:

  • It hasn’t been updated since 2019.

  • It misses modern terms like TikTok.

  • It doesn’t cover phrases.

I’ve scored about a million English entries from Wiktionary for recognizability. I built this for a pun tool. But I want to use the data for a new language project.

The dataset is too bloated because it’s full of inflected forms. Even if I set the recognizability threshold at 50 percent, I’m still looking at 100K words and 100K phrases. Going through a list that size is a waste of time. I need to filter the data through the English lemmas category from Wiktionary and split the single words from the multi-word phrases into separate lists.

Since the hard part of scoring is done, the rest should be easy peasy lemma squeezy. I just want to avoid reinventing the wheel if I can.

Before I spin up a separate repository to handle this, I’m checking if a similar dataset already exists. Has anyone seen a project that offers this?

submitted by /u/8ta4
[link] [comments]

Parallelogram – A Strict Linter For LLM Fine-tuning Datasets (catches Broken Data Before Your GPU Run Starts)

Fine-tuning frameworks assume your data is correctly formatted. None of them enforce it. The result is broken training runs discovered after the compute is spent.

Parallelogram is a CLI tool that validates fine-tuning datasets before any training starts. Strict hard-blocks on role sequence errors, empty turns, context window violations, duplicates, and mojibake. Exits 0 on clean data, exits 1 on errors — CI/CD friendly.

Apache 2.0, local-first, zero network calls.

Looking for feedback on edge cases people have hit in real fine-tuning workflows. Love for you to try it out.

submitted by /u/Quiet-Nerd-5786
[link] [comments]

Built A CLI To Clean GitHub Repos Into LLM Training Data.. Is Manual Cleanup A Real Bottleneck For You?

Every time I try to use GitHub repos for LLM training, I lose hours cleaning junk like .git files, lock files, minified JS, generated code, binaries mixed with real source.

Public datasets like The Stack are great for general pretraining. But if you want a model on a specific stack or curated repos, you end up building the dataset (and the cleanup pipeline) yourself. So I built a CLI tool called RepoCurator to make that step reusable.

What it does:

– Clones repos (shallow for speed)

– Filters noise using rules

– Scores files (0.0–1.0) based on usefulness

– Exports clean, per-file datasets (JSON/TXT)

Still early trying to validate if this is a real problem. If it resonates, register interest on the page. It helps me decide whether to keep building

Question:

How are you currently cleaning repos before using them for training or analysis?

submitted by /u/Ok_Rub3312
[link] [comments]

Prompt Generator/text Generator For Image Generation

Hello fellow developers and analysts, I’m working on a project that will be using image generator models to generate thousands of images.

I have been tasked to find a text or prompt generator model or models to use with the image generators.

So for each image that is created a different prompt needs to be used.

If i run these for 2 days to create images the prompts also need to change.

If anybody has any suggestions or can point me in the right direction that would be great.

We will be add using the models to our instance and using it from there.

Any help would be appreciated

submitted by /u/Junior_Wheel1690
[link] [comments]