Every time I try to use GitHub repos for LLM training, I lose hours cleaning junk like .git files, lock files, minified JS, generated code, binaries mixed with real source.
Public datasets like The Stack are great for general pretraining. But if you want a model on a specific stack or curated repos, you end up building the dataset (and the cleanup pipeline) yourself. So I built a CLI tool called RepoCurator to make that step reusable.
What it does:
– Clones repos (shallow for speed)
– Filters noise using rules
– Scores files (0.0–1.0) based on usefulness
– Exports clean, per-file datasets (JSON/TXT)
Still early trying to validate if this is a real problem. If it resonates, register interest on the page. It helps me decide whether to keep building
Question:
How are you currently cleaning repos before using them for training or analysis?
submitted by /u/Ok_Rub3312
[link] [comments]