{"id":39986,"date":"2026-03-27T14:31:13","date_gmt":"2026-03-27T13:31:13","guid":{"rendered":"https:\/\/www.graviton.at\/letterswaplibrary\/built-a-dataset-generation-skill-after-spending-way-too-much-on-openai-claude-and-gemini-apis\/"},"modified":"2026-03-27T14:31:13","modified_gmt":"2026-03-27T13:31:13","slug":"built-a-dataset-generation-skill-after-spending-way-too-much-on-openai-claude-and-gemini-apis","status":"publish","type":"post","link":"https:\/\/www.graviton.at\/letterswaplibrary\/built-a-dataset-generation-skill-after-spending-way-too-much-on-openai-claude-and-gemini-apis\/","title":{"rendered":"Built A Dataset Generation Skill After Spending Way Too Much On OpenAI, Claude, And Gemini APIs"},"content":{"rendered":"<p><!-- SC_OFF --><\/p>\n<div class=\"md\">\n<p>Hey \ud83d\udc4b<\/p>\n<p>Quick project showcase. I built a dataset generation skill for Claude, Codex, and Antigravity after spending way too much on the OpenAI, Claude, and Gemini APIs.<\/p>\n<p>At first I was using APIs for the whole workflow. That worked, but it got expensive really fast once the work stopped being just &#8220;generate examples&#8221; and became:<br \/> generate -&gt; inspect -&gt; dedup -&gt; rebalance -&gt; verify -&gt; audit -&gt; re-export -&gt; repeat<\/p>\n<p>So I moved the workflow into a skill and pushed as much as possible into a deterministic local pipeline.<\/p>\n<p>The useful part is that it is not just a synthetic dataset generator.<br \/> You can ask it to:<br \/> &#8220;generate a medical triage dataset&#8221;<br \/> &#8220;turn these URLs into a training dataset&#8221;<br \/> &#8220;use web research to build a fintech FAQ dataset&#8221;<br \/> &#8220;normalize this CSV into OpenAI JSONL&#8221;<br \/> &#8220;audit this dataset and tell me what is wrong with it&#8221;<\/p>\n<p>It can generate from a topic, research the topic first, collect from URLs, collect from local files\/repos, or normalize an existing dataset into one canonical pipeline.<\/p>\n<p>How it works:<br \/> The agent handles planning and reasoning.<br \/> The local pipeline handles normalization, verification, generation-time dedup, coverage steering, semantic review hooks, export, and auditing.<\/p>\n<p>What it does:<br \/> &#8211; Research-first dataset building instead of pure synthetic generation<br \/> &#8211; Canonical normalization into one internal schema<br \/> &#8211; Generation-time dedup so duplicates get rejected during the build<br \/> &#8211; Coverage checks while generating so the next batch targets missing buckets<br \/> &#8211; Semantic review via review files, not just regex-style heuristics<br \/> &#8211; Corpus audits for split leakage, context leakage, taxonomy balance, and synthetic fingerprints<br \/> &#8211; Export to OpenAI, HuggingFace, CSV, or flat JSONL<br \/> &#8211; Prompt sanitization on export so training-facing fields are safer by default while metadata stays available for analysis<\/p>\n<p>How it is built under the hood: <\/p>\n<p>SKILL.md (orchestrator)<br \/> \u251c\u2500\u2500 12 sub-skills (dataset-strategy, seed-generator, local-collector, llm-judge, dataset-auditor, &#8230;)<br \/> \u251c\u2500\u2500 8 pipeline scripts (generate.py, build_loop.py, verify.py, dedup.py, export.py, &#8230;)<br \/> \u251c\u2500\u2500 9 utility modules (canonical.py, visibility.py, coverage_plan.py, db.py, &#8230;)<br \/> \u251c\u2500\u2500 1 internal canonical schema<br \/> \u251c\u2500\u2500 3 export presets<br \/> \u2514\u2500\u2500 50 automated tests<\/p>\n<p>The reason I built it this way is cost.<br \/> I did not want to keep paying API prices for orchestration, cleanup, validation, and export logic that can be done locally.<\/p>\n<p>The second reason is control.<br \/> I wanted a workflow where I can inspect the data, keep metadata, audit the corpus, and still export a safer training artifact when needed.<\/p>\n<p>It started as a way to stop burning money on dataset iteration, but it ended up becoming a much cleaner dataset engineering workflow overall.<\/p>\n<p>If people want to try it:<\/p>\n<pre><code>git clone https:\/\/github.com\/Bhanunamikaze\/AI-Dataset-Generator.git cd AI-Dataset-Generator .\/install.sh --target all --force or you can simply run curl -sSL https:\/\/raw.githubusercontent.com\/Bhanunamikaze\/ai-dataset-generator\/main\/install.sh | bash -s -- --online --target all <\/code><\/pre>\n<p>Then restart the IDE session and ask it to build or audit a dataset.<\/p>\n<p>Repo:<br \/> <a href=\"https:\/\/github.com\/Bhanunamikaze\/AI-Dataset-Generator\">https:\/\/github.com\/Bhanunamikaze\/AI-Dataset-Generator<\/a><\/p>\n<p>If anyone here is building fine-tuning or eval datasets, I would genuinely love feedback on the workflow.<br \/> \u2b50 Star it if the skill pattern feels useful<br \/> \ud83d\udc1b Open an issue if you find something broken<br \/> \ud83d\udd00 PRs are very welcome<\/p>\n<\/div>\n<p><!-- SC_ON -->   submitted by   <a href=\"https:\/\/www.reddit.com\/user\/Illustrious-triffle\"> \/u\/Illustrious-triffle <\/a> <br \/> <span><a href=\"https:\/\/github.com\/Bhanunamikaze\/AI-Dataset-Generator\">[link]<\/a><\/span>   <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1s53a3c\/built_a_dataset_generation_skill_after_spending\/\">[comments]<\/a><\/span><\/p><div class='watch-action'><div class='watch-position align-right'><div class='action-like'><a class='lbg-style1 like-39986 jlk' href='javascript:void(0)' data-task='like' data-post_id='39986' data-nonce='65e0e39b87' rel='nofollow'><img class='wti-pixel' src='https:\/\/www.graviton.at\/letterswaplibrary\/wp-content\/plugins\/wti-like-post\/images\/pixel.gif' title='Like' \/><span class='lc-39986 lc'>0<\/span><\/a><\/div><\/div> <div class='status-39986 status align-right'><\/div><\/div><div class='wti-clear'><\/div>","protected":false},"excerpt":{"rendered":"<p>Hey \ud83d\udc4b Quick project showcase. I built a dataset generation skill for Claude, Codex, and Antigravity after&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[85],"tags":[],"class_list":["post-39986","post","type-post","status-publish","format-standard","hentry","category-datatards","wpcat-85-id"],"_links":{"self":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/39986","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/comments?post=39986"}],"version-history":[{"count":0,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/39986\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/media?parent=39986"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/categories?post=39986"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/tags?post=39986"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}