Open source tool for generating and cleaning synthetic instruction-tuning datasets

Built this because I wanted a reproducible way to build fine-tuning datasets without doing it all by hand.

You give it seed prompts or an existing dataset, it generates instruction-output pairs via any OpenRouter model, scores them with a local or remote LLM judge, and exports a clean JSONL you can use directly for training.

You can also ingest datasets straight from HuggingFace and filter or relabel them through the same pipeline.

The export step lets you set a score threshold and a train/val split ratio so what comes out is ready to use.

MIT licensed, everything is stored locally, no data leaves your machine unless you choose a cloud judge backend.

Github project link is in comments below 👇

submitted by /u/gvij
[link] [comments]

Open Source Tool For Generating And Cleaning Synthetic Instruction-tuning Datasets

Leave a Reply Cancel reply

Recent Posts

Recent Comments

18+ Content

Leave a Reply Cancel reply

Recent Posts

Recent Comments