Anyone else wasting hours cleaning GitHub data for LLM fine-tuning?
I tried building my own dataset (instead of relying on Hugging Face), but scraping repos is messy node_modules, lockfiles, minified code, binaries… tons of junk.
Feels like more time goes into cleaning than actual training.
Curious how you’re handling this:
custom scripts?
existing tools?
or just manual cleanup?
Also how are you structuring data for different LLM formats?
Thinking about building something to automate this if it’s a common problem..
Would love to hear workflows you guys work with.
submitted by /u/Ok_Rub3312
[link] [comments]