Built a CLI to clean GitHub repos into LLM training data.. is manual cleanup a real bottleneck for you?

Every time I try to use GitHub repos for LLM training, I lose hours cleaning junk like .git files, lock files, minified JS, generated code, binaries mixed with real source.

Public datasets like The Stack are great for general pretraining. But if you want a model on a specific stack or curated repos, you end up building the dataset (and the cleanup pipeline) yourself. So I built a CLI tool called RepoCurator to make that step reusable.

What it does:

– Clones repos (shallow for speed)

– Filters noise using rules

– Scores files (0.0–1.0) based on usefulness

– Exports clean, per-file datasets (JSON/TXT)

Still early trying to validate if this is a real problem. If it resonates, register interest on the page. It helps me decide whether to keep building

Question:

How are you currently cleaning repos before using them for training or analysis?

submitted by /u/Ok_Rub3312
[link] [comments]

Built A CLI To Clean GitHub Repos Into LLM Training Data.. Is Manual Cleanup A Real Bottleneck For You?

Leave a Reply Cancel reply

Recent Posts

Recent Comments

18+ Content

Leave a Reply Cancel reply

Recent Posts

Recent Comments