Best Way To Clean GitHub Data (remove Node_modules, Lockfiles, Etc) For LLM Fine-tuning?

Anyone else wasting hours cleaning GitHub data for LLM fine-tuning?

I tried building my own dataset (instead of relying on Hugging Face), but scraping repos is messy node_modules, lockfiles, minified code, binaries… tons of junk.

Feels like more time goes into cleaning than actual training.

Curious how you’re handling this:

custom scripts?

existing tools?

or just manual cleanup?

Also how are you structuring data for different LLM formats?

Thinking about building something to automate this if it’s a common problem..

Would love to hear workflows you guys work with.

submitted by /u/Ok_Rub3312
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *