{"id":22485,"date":"2023-09-20T19:27:14","date_gmt":"2023-09-20T17:27:14","guid":{"rendered":"https:\/\/www.graviton.at\/letterswaplibrary\/i-built-a-free-tool-that-auto-generates-scrapers-for-any-website-with-ai\/"},"modified":"2023-09-20T19:27:14","modified_gmt":"2023-09-20T17:27:14","slug":"i-built-a-free-tool-that-auto-generates-scrapers-for-any-website-with-ai","status":"publish","type":"post","link":"https:\/\/www.graviton.at\/letterswaplibrary\/i-built-a-free-tool-that-auto-generates-scrapers-for-any-website-with-ai\/","title":{"rendered":"I Built A Free Tool That Auto-generates Scrapers For Any Website With AI"},"content":{"rendered":"<p><!-- SC_OFF --><\/p>\n<div class=\"md\">\n<p>I got frustrated with the time and effort required to code and maintain custom web scrapers for collecting data, so me and my friends built an LLM-based solution for data extraction from websites. AI should automate tedious and un-creative work, and web scraping definitely fits this description.<\/p>\n<p>Try it out for free on our playground <a href=\"https:\/\/kadoa.com\/playground\">https:\/\/kadoa.com\/playground<\/a> and let me know what you think!<\/p>\n<p>We&#8217;re leveraging LLMs to understand the website structure and generate the DOM selectors for it. Using LLMs for every data extraction, as most comparable tools do, would be way too expensive and very slow, but using LLMs to generate the scraper code and subsequently adapt it to website modifications is highly efficient and maintenance-free.<\/p>\n<p>How it works (the playground uses a simplified version of this):<\/p>\n<p>  Loading the website: automatically decide what kind of proxy and browser we need Analyzing network calls: Try to find the desired data in the network calls Preprocessing the DOM: remove all unnecessary elements, compress it into a structure that GPT can understand Selector generation: Use an LLM to find the desired information with the corresponding selectors Data extraction in the desired format Validation: Hallucination checks and verification that the data is actually on the website and in the right format Data transformation: Clean and map the data (e.g. if we need to aggregate data from multiple sources into the same format). LLMs are great at this task too  <\/p>\n<p>The vision is fully autonomous and maintenance-free data processing from sources like websites or PDFs, basically &#8220;prompt-to-data&#8221; \ud83d\ude42 It&#8217;s far from perfect yet, but we&#8217;ll get there.<\/p>\n<\/div>\n<p><!-- SC_ON -->   submitted by   <a href=\"https:\/\/www.reddit.com\/user\/madredditscientist\"> \/u\/madredditscientist <\/a> <br \/> <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/16nq9n6\/i_built_a_free_tool_that_autogenerates_scrapers\/\">[link]<\/a><\/span>   <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/16nq9n6\/i_built_a_free_tool_that_autogenerates_scrapers\/\">[comments]<\/a><\/span><\/p><div class='watch-action'><div class='watch-position align-right'><div class='action-like'><a class='lbg-style1 like-22485 jlk' href='javascript:void(0)' data-task='like' data-post_id='22485' data-nonce='65e0e39b87' rel='nofollow'><img class='wti-pixel' src='https:\/\/www.graviton.at\/letterswaplibrary\/wp-content\/plugins\/wti-like-post\/images\/pixel.gif' title='Like' \/><span class='lc-22485 lc'>0<\/span><\/a><\/div><\/div> <div class='status-22485 status align-right'><\/div><\/div><div class='wti-clear'><\/div>","protected":false},"excerpt":{"rendered":"<p>I got frustrated with the time and effort required to code and maintain custom web scrapers for&#8230;<\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[85],"tags":[],"class_list":["post-22485","post","type-post","status-publish","format-standard","hentry","category-datatards","wpcat-85-id"],"_links":{"self":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/22485","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/comments?post=22485"}],"version-history":[{"count":0,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/22485\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/media?parent=22485"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/categories?post=22485"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/tags?post=22485"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}