{"id":34828,"date":"2025-07-30T03:27:27","date_gmt":"2025-07-30T01:27:27","guid":{"rendered":"https:\/\/www.graviton.at\/letterswaplibrary\/how-do-people-collect-data-using-crawlers-for-fine-tuning\/"},"modified":"2025-07-30T03:27:27","modified_gmt":"2025-07-30T01:27:27","slug":"how-do-people-collect-data-using-crawlers-for-fine-tuning","status":"publish","type":"post","link":"https:\/\/www.graviton.at\/letterswaplibrary\/how-do-people-collect-data-using-crawlers-for-fine-tuning\/","title":{"rendered":"How Do People Collect Data Using Crawlers For Fine Tuning?"},"content":{"rendered":"<p><!-- SC_OFF --><\/p>\n<div class=\"md\">\n<p>I am fairly new to ML and I&#8217;ve been wanting to fine tune a model (T5-base\/large) with my own dataset. There are a few problems i&#8217;ve been encountering:<\/p>\n<ol>\n<li>\n<p>Writing a script to scrape different websites but it comes with a lot of noise.<\/p>\n<\/li>\n<li>\n<p>I need to write a different script for different websites<\/p>\n<\/li>\n<li>\n<p>Some data that are scraped could be wrong or incomplete<\/p>\n<\/li>\n<li>\n<p>I&#8217;ve tried manually checking a few thousand samples and come to a conclusion that I shouldn&#8217;t have wasted my time in the first place.<\/p>\n<\/li>\n<li>\n<p>Sometimes the script works but a different html format in the same website led to noise in my samples where I would not have realised unless I manually go through all the samples.<\/p>\n<\/li>\n<\/ol>\n<p>Solutions i&#8217;ve tried:<br \/> 1. Using ChatGPT to generate samples. (The generated samples are not good enough for fine tuning and most of them are repetitive.)<\/p>\n<ol>\n<li>\n<p>Manually adding sample (takes fucking forever idk why I even tried this should&#8217;ve been obvious, but I was desperate)<\/p>\n<\/li>\n<li>\n<p>Write a mini script to scrape from each source (works to an extent, I have to keep writing a new script and the data scraped are also noisy.)<\/p>\n<\/li>\n<li>\n<p>Tried using regex to clean the data but some of them are too noisy and random to properly clean (It works, but about 20-30% of the data are still extremely noisy and im not sure how i can clean them)<\/p>\n<\/li>\n<li>\n<p>I&#8217;ve tried looking on huggingface and other websites but couldn&#8217;t exactly find the data im looking for and even if it did its insufficient. (tbf I also wanted to collect data on my own to see how it works)<\/p>\n<\/li>\n<\/ol>\n<p>So, my question is: Is there any way where I am able to get clean data easier? What kind of crawlers\/scripts I can use to help me automate this process? Or more precisely I want to know what&#8217;s the go to solution\/technique that is used to collect data.<\/p>\n<\/div>\n<p><!-- SC_ON -->   submitted by   <a href=\"https:\/\/www.reddit.com\/user\/Loud-Dream-975\"> \/u\/Loud-Dream-975 <\/a> <br \/> <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1mcsfgq\/how_do_people_collect_data_using_crawlers_for\/\">[link]<\/a><\/span>   <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1mcsfgq\/how_do_people_collect_data_using_crawlers_for\/\">[comments]<\/a><\/span><\/p><div class='watch-action'><div class='watch-position align-right'><div class='action-like'><a class='lbg-style1 like-34828 jlk' href='javascript:void(0)' data-task='like' data-post_id='34828' data-nonce='65e0e39b87' rel='nofollow'><img class='wti-pixel' src='https:\/\/www.graviton.at\/letterswaplibrary\/wp-content\/plugins\/wti-like-post\/images\/pixel.gif' title='Like' \/><span class='lc-34828 lc'>0<\/span><\/a><\/div><\/div> <div class='status-34828 status align-right'><\/div><\/div><div class='wti-clear'><\/div>","protected":false},"excerpt":{"rendered":"<p>I am fairly new to ML and I&#8217;ve been wanting to fine tune a model (T5-base\/large) with&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[85],"tags":[],"class_list":["post-34828","post","type-post","status-publish","format-standard","hentry","category-datatards","wpcat-85-id"],"_links":{"self":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/34828","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/comments?post=34828"}],"version-history":[{"count":0,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/34828\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/media?parent=34828"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/categories?post=34828"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/tags?post=34828"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}