Category: Datatards

Here you can observe the biggest nerds in the world in their natural habitat, longing for data sets. Not that it isn’t interesting, i’m interested. Maybe they know where the chix are. But what do they need it for? World domination?

Best Way To Clean GitHub Data (remove Node_modules, Lockfiles, Etc) For LLM Fine-tuning?

Anyone else wasting hours cleaning GitHub data for LLM fine-tuning?

I tried building my own dataset (instead of relying on Hugging Face), but scraping repos is messy node_modules, lockfiles, minified code, binaries… tons of junk.

Feels like more time goes into cleaning than actual training.

Curious how you’re handling this:

custom scripts?

existing tools?

or just manual cleanup?

Also how are you structuring data for different LLM formats?

Thinking about building something to automate this if it’s a common problem..

Would love to hear workflows you guys work with.

submitted by /u/Ok_Rub3312
[link] [comments]

PiC/phrase_retrieval Dataset (PR-pass & PR-page) Is Broken — Does Anyone Have A Local Copy?

Hey everyone,

I’ve been trying to use the ‘PiC (Phrase-in-Context) Phrase Retrieval dataset from HuggingFace (`PiC/phrase_retrieval`, configs: PR-pass and PR-page) but the loader is broken because the underlying data files hosted at `auburn.edu/~tmp0038/PiC/` are returning a ‘403 Forbidden’ error.

The HuggingFace dataset loader depends entirely on that external Auburn University server, so the dataset is currently unusable for anyone trying to load it programmatically.

I’ve already reached out to the authors (Thang Pham and Anh), but unfortunately got no positive response yet.

If anyone: Downloaded this dataset before the server went down and has the raw JSON files (`train-v1.0.json`, `dev-v1.0.json`, `test-v1.0.json`) for either PR-pass or PR-page; I would really appreciate if you could share.

Thanks in advance!

submitted by /u/BugSolid3436
[link] [comments]

Why Doesnt Anyone Share Full Historical Tick Data Here. I Have Financial Data L3 OrderFlow Data

i don’t want to pay for historical tick data, prefer free historical tick data. on all available assets.
– the question, why does no one ever share the data that they paid for, obviously i am aware its because people pay a lot of money for such historical tick data.

that being said. i am willing to give up my gate keeping, only if this thread produces links to historical tick data for free that was paid(must be clean, for research purposes)

– i will publicly share a link that openly shares: Level 3 OrderFlow that goes back to feburary/2026(yes level 3 OrderFlow that is locked to institutions with pockets)
– this website houses 1min Options Data from polygon.io
– Historical economic calendar data for 20 countries, 125,368 events spanning 2015 to 2026. Includes event names, actual vs consensus vs previous values, and period information.
– Complete fundamental data for 984 companies. Each download includes the company profile, income statements, balance sheets, cash flows, key metrics, financial ratios, growth rates, earnings calendar, and insider trades.
– OHLCV candle data across 25,008 datasets.4,168 symbols in 6 timeframes from 1 minute to daily. One CSV.gz file per year per timeframe per symbol.

but it does not have tick data.

submitted by /u/liquidatedis
[link] [comments]

I Got Tired Of Checking Kaggle, HuggingFace, Data.gov, And Other Sites Every Time I Needed A Dataset, So I Built A Tool That Searches All Of Them At Once

Disclosure: I’m one of the creators of this tool.

Hi all,

I do ML research at Berkeley and the most tedious part of every project is dataset discovery. I’d spend hours opening tabs across Kaggle, HuggingFace, data.gov, Census, WHO, Semantic Scholar, and a dozen other platforms just to find the right data. Then I’d have to manually check licenses, preview columns, and figure out citations.

So my friend and I built Mobus, an open-source MCP server that lets you do all of that from inside Claude or Cursor. You describe what you need in natural language and it searches across 20 platforms, lets you preview the actual data, checks licenses, and generates citations.

It’s free and open source: https://github.com/mobus-ai/Mobus

Quick demo on the site if you want to see it in action: https://mobus.ai

Would love feedback from anyone who deals with this pain point. What data sources are missing that you’d want to see added?

submitted by /u/Swimming_Outside_988
[link] [comments]

Seeking A Dataset Of English Lemmas With Recognizability Scores

I checked out the word prevalence dataset of 62,000 lemmas. But it has some limitations:

  • It hasn’t been updated since 2019.

  • It misses modern terms like TikTok.

  • It doesn’t cover phrases.

I’ve scored about a million English entries from Wiktionary for recognizability. I built this for a pun tool. But I want to use the data for a new language project.

The dataset is too bloated because it’s full of inflected forms. Even if I set the recognizability threshold at 50 percent, I’m still looking at 100K words and 100K phrases. Going through a list that size is a waste of time. I need to filter the data through the English lemmas category from Wiktionary and split the single words from the multi-word phrases into separate lists.

Since the hard part of scoring is done, the rest should be easy peasy lemma squeezy. I just want to avoid reinventing the wheel if I can.

Before I spin up a separate repository to handle this, I’m checking if a similar dataset already exists. Has anyone seen a project that offers this?

submitted by /u/8ta4
[link] [comments]

Parallelogram – A Strict Linter For LLM Fine-tuning Datasets (catches Broken Data Before Your GPU Run Starts)

Fine-tuning frameworks assume your data is correctly formatted. None of them enforce it. The result is broken training runs discovered after the compute is spent.

Parallelogram is a CLI tool that validates fine-tuning datasets before any training starts. Strict hard-blocks on role sequence errors, empty turns, context window violations, duplicates, and mojibake. Exits 0 on clean data, exits 1 on errors — CI/CD friendly.

Apache 2.0, local-first, zero network calls.

Looking for feedback on edge cases people have hit in real fine-tuning workflows. Love for you to try it out.

submitted by /u/Quiet-Nerd-5786
[link] [comments]

Built A CLI To Clean GitHub Repos Into LLM Training Data.. Is Manual Cleanup A Real Bottleneck For You?

Every time I try to use GitHub repos for LLM training, I lose hours cleaning junk like .git files, lock files, minified JS, generated code, binaries mixed with real source.

Public datasets like The Stack are great for general pretraining. But if you want a model on a specific stack or curated repos, you end up building the dataset (and the cleanup pipeline) yourself. So I built a CLI tool called RepoCurator to make that step reusable.

What it does:

– Clones repos (shallow for speed)

– Filters noise using rules

– Scores files (0.0–1.0) based on usefulness

– Exports clean, per-file datasets (JSON/TXT)

Still early trying to validate if this is a real problem. If it resonates, register interest on the page. It helps me decide whether to keep building

Question:

How are you currently cleaning repos before using them for training or analysis?

submitted by /u/Ok_Rub3312
[link] [comments]

Prompt Generator/text Generator For Image Generation

Hello fellow developers and analysts, I’m working on a project that will be using image generator models to generate thousands of images.

I have been tasked to find a text or prompt generator model or models to use with the image generators.

So for each image that is created a different prompt needs to be used.

If i run these for 2 days to create images the prompts also need to change.

If anybody has any suggestions or can point me in the right direction that would be great.

We will be add using the models to our instance and using it from there.

Any help would be appreciated

submitted by /u/Junior_Wheel1690
[link] [comments]

Working On Real-time Data From Brands, And Social Media

Working on browser based agents that can fetch real time digital content like posts, images, brand details and videos from social media and company websites using natural language queries, and turn it into structured data you can directly use.

The goal is to plug this into digital marketing workflows for things like trend tracking, content inspiration, competitor monitoring, and campaign research without manual browsing or scraping. Is this something people would be interested in

submitted by /u/agentbrowser091
[link] [comments]

Natural Disasters Normalized For Cross Domain Comparisons

I’ve been building a program for the past couple months and it’s in good shape to share now.

The meat of it is earthquakes, volcanos, tsunami’s, hurricanes, tornados, currencies, CIA Facebook, and the UN SDGs (plenty more coming). I’ve got all these datasets normalized to a loc-id system, so you can ask across data really easy and opened up the API lanes and made MCP tools. Some are paid datasets, I’m using x402 for a few. Plenty are free though, so check it out!

www.daedalmap.com/agents

There’s the human side app as well, you can explore there to see what it’s like, I’ve been building a research mode that allows users to take a bounded set of data and ask questions to it

submitted by /u/Xyver
[link] [comments]

Searching A Too To Generate A Dataset

Hi everyone,

I’m working on an anomaly detection project using logs from an all-in-one OpenStack deployment (Ansible-based). The logs come from multiple sources , and are collected via Fluentd and sent to OpenSearch.

My main problem is that I don’t have a dataset, and I don’t have enough time to build one manually.

I’m considering running OpenStack for a full day to generate a large amount of logs, then using a tool to generate more data to have a huge and good dataset for anomaly detection.

Are there any tools or approaches that can help generate a good dataset from my own logs in this kind of setup? (Logs are json lines!)

Thanks in advance!

submitted by /u/Substantial_Elk_2999
[link] [comments]

[Disclaimer – My Personal Project] Built This Advanced But Extremely Beginner Friendly Data Visualisation Tool. Please Share Your Thoughts

Hey everyone

I’m thrilled to share Polyform — the modern way to analyse and visualise data without the usual headaches.

Tired of juggling spreadsheets for editing and separate tools for charting? Polyform lets you edit data just like a familiar spreadsheet, while instantly visualising it across 24+ beautiful chart types at the same time — bar, line, pie, scatter, radar, heatmap, candlestick, waterfall, gauge, 3D surface, and many more.

Key highlights:

Change any value and watch your charts animate instantly — no refresh, no lag.

Connect multiple data sheets (e.g., sales + regions) and create combined visuals in one chart.

Sign in and start working immediately. Everything lives in the cloud.

Generate a shareable link — teammates can view or edit without signing up.

Charts as PNG/JPG/PDF, data as CSV/Excel, or full dashboards.

Add rows/columns on the fly, custom color palettes, link locking for safety, and financial/KPI charts built-in.

Whether you’re a solo analyst spotting trends or a growing team needing fast insights, Polyform scales with you. From raw data to shareable, insightful dashboards in under a minute.

No plugins. No complex setup. Just powerful, real-time data storytelling.

Try it here: https://polyform-graphs.lovable.app

Would love your feedback — what’s the one chart type or workflow you wish existed in your current tools? Whats in here that can be improved ?

submitted by /u/FOR_REAL_NOT_REAL
[link] [comments]

Where Do You Look For Reliable Datasets That Aren’t Behind Paywalls?

finding datasets isn’t that hard, but finding ones that are actually reliable, well-documented, and usable (without a paywall) is a different story.

obviously there’s government portals, World Bank etc but even their pretty hit or miss depending on data structure and maintainance

where do you consistently go when you need solid datasets?not just a big list of datasets but sources you actually trust for things like documentation, clear definitions / methodology, reasonably up-to-date data something you’d feel comfortable citing or building on?

Please drop links to if you can, always looking to build a better mental list of go-to sources.

submitted by /u/Rude_Context_4844
[link] [comments]

[PAID] Built A Real-time Salary Dataset From Fortune 500 Workday Job Postings — 100% US Salary Coverage Because Of Pay Transparency Laws. Free Sample Available. [Disclosure: Our Product]

my co-founder and i have been building this for a few months and wanted to share here .

150K-300K active job postings refreshed weekly, 100% US salary coverage, 22 structured fields including salary_min, salary_max, job_category, remote_type, worker_type, requirements, and posted_date. companies include NVIDIA, Goldman Sachs, Walmart, Target, Disney, Pfizer, Boeing, Deloitte and 1,200+ others.

CSV or JSON, ready for R, Stata, or Python out of the box.

een getting interest from labor economists studying pay transparency laws and HR analytics teams — figured researchers here might find it useful too.

this dataset isn’t on our site yet — submit a custom data request at datapulse.skop.dev/custom-request and we’ll get back to you with a free sample within a few hours.

what fields are we missing?

submitted by /u/Sufficient-War-4020
[link] [comments]

Seeking IMDb Gendered Ratings (Raw Scores) Post-2018 For A Data Viz Project

I’m building a site that visualizes gender differences and similarities in movie ratings (screenshots: https://imgur.com/a/yEM5wUd). Currently I’m using a 2018 IMDb list of the top 200 movies rated by women, but it’s outdated and likely misses many highly men-favored films that didn’t make that specific list.

While IMDb displayed gendered ratings until early 2023, their official TSV datasets only provide the aggregate averageRating. I need the specific Male vs. Female raw ratings, not just a gendered rank.

Does anyone know of a dataset, archive, or scraper output from 2019–2023 that captured the demographics breakdown before the UI changes? I’ve checked the standard IMDb non-commercial sets, but the granularity isn’t there.

Thanks!

submitted by /u/HandToDirt
[link] [comments]

Nobody Asked For It, But I Still Built It.

As you can tell from all the titles and the tags, this is an NSFW manga dataset. with over 500k+ data of manga ID, title, release date, and all the other metadata.

I haven’t updated it since March this month. No need to worry, though; I promise to update it more frequently. And the favorites’ number may vary from when it was posted to when it was scraped.

Feel free to use it in your personal data science projects. And tag me if you make something hilarious.

submitted by /u/banana_737
[link] [comments]

[Self-Promotion][Custom Dataset Infrastructure] Where Public Datasets Keep Falling Short For Production AI Systems

Over the past few months, we’ve been helping teams source highly specific datasets that public benchmarks consistently miss.

Some examples:

– Off-script voice agent conversations (interruptions, objections, mixed intent)

– Real human SaaS workflow screen recordings

– Industrial OCR edge cases (reflective packaging, degraded print)

– Computer vision long-tail failures (low-light, oblique angles, occlusion)

– Agent workflow regression scenarios (schema drift, retries, stale state)

Biggest takeaway:

For most production AI systems, the bottleneck usually isn’t the model.

It’s dataset coverage around messy real-world deployment conditions.

Public datasets are usually enough for demos.

Custom datasets are what close the gap to production reliability.

The more specialized the deployment environment becomes, the more valuable targeted data infrastructure becomes.

If you’re actively running into dataset gaps that public benchmarks aren’t solving, feel free to DM me with what you need, always happy to compare notes or help scope solutions.

submitted by /u/Khade_G
[link] [comments]