I am fairly new to ML and I’ve been wanting to fine tune a model (T5-base/large) with my own dataset. There are a few problems i’ve been encountering:
-
Writing a script to scrape different websites but it comes with a lot of noise.
-
I need to write a different script for different websites
-
Some data that are scraped could be wrong or incomplete
-
I’ve tried manually checking a few thousand samples and come to a conclusion that I shouldn’t have wasted my time in the first place.
-
Sometimes the script works but a different html format in the same website led to noise in my samples where I would not have realised unless I manually go through all the samples.
Solutions i’ve tried:
1. Using ChatGPT to generate samples. (The generated samples are not good enough for fine tuning and most of them are repetitive.)
-
Manually adding sample (takes fucking forever idk why I even tried this should’ve been obvious, but I was desperate)
-
Write a mini script to scrape from each source (works to an extent, I have to keep writing a new script and the data scraped are also noisy.)
-
Tried using regex to clean the data but some of them are too noisy and random to properly clean (It works, but about 20-30% of the data are still extremely noisy and im not sure how i can clean them)
-
I’ve tried looking on huggingface and other websites but couldn’t exactly find the data im looking for and even if it did its insufficient. (tbf I also wanted to collect data on my own to see how it works)
So, my question is: Is there any way where I am able to get clean data easier? What kind of crawlers/scripts I can use to help me automate this process? Or more precisely I want to know what’s the go to solution/technique that is used to collect data.
submitted by /u/Loud-Dream-975
[link] [comments]