{"id":37348,"date":"2025-12-23T09:27:25","date_gmt":"2025-12-23T08:27:25","guid":{"rendered":"https:\/\/www.graviton.at\/letterswaplibrary\/scrapegraphai-100k-100000-real-world-structured-llm-output-examples-from-production-usage\/"},"modified":"2025-12-23T09:27:25","modified_gmt":"2025-12-23T08:27:25","slug":"scrapegraphai-100k-100000-real-world-structured-llm-output-examples-from-production-usage","status":"publish","type":"post","link":"https:\/\/www.graviton.at\/letterswaplibrary\/scrapegraphai-100k-100000-real-world-structured-llm-output-examples-from-production-usage\/","title":{"rendered":"ScrapeGraphAI 100k: 100,000 Real-World Structured LLM Output Examples From Production Usage"},"content":{"rendered":"<p><!-- SC_OFF --><\/p>\n<div class=\"md\">\n<p># <a href=\"https:\/\/www.reddit.com\/r\/datasets\">r\/datasets<\/a> &#8211; ScrapeGraphAI 100k Post<\/p>\n<p>Announcing <strong>ScrapeGraphAI 100k<\/strong> &#8211; a dataset of 100,000 real-world structured extraction examples from the open-source ScrapeGraphAI library:<\/p>\n<p><a href=\"https:\/\/huggingface.co\/datasets\/scrapegraphai\/scrapegraphai-100k\">https:\/\/huggingface.co\/datasets\/scrapegraphai\/scrapegraphai-100k<\/a><\/p>\n<p><strong>What&#8217;s Inside:<\/strong><\/p>\n<p>This is raw production data &#8211; not synthetic, not toy problems. Derived from 9 million PostHog events collected from real users of ScrapeGraphAI during Q2-Q3 2025.<\/p>\n<p>Every example includes:<\/p>\n<p>&#8211; `prompt`: Actual user instructions sent to the LLM<\/p>\n<p>&#8211; `schema`: JSON schema defining expected output structure<\/p>\n<p>&#8211; `response`: What the LLM actually returned<\/p>\n<p>&#8211; `content`: Source web content (markdown)<\/p>\n<p>&#8211; `llm_model`: Which model was used (89% gpt-4o-mini)<\/p>\n<p>&#8211; `source`: Source URL<\/p>\n<p>&#8211; `execution_time`: Real timing data<\/p>\n<p>&#8211; `response_is_valid`: Ground truth validation (avg 93% valid)<\/p>\n<p><strong>Schema Complexity Metrics:<\/strong><\/p>\n<p>&#8211; `schema_depth`: Nesting levels (typically 2-4, max ~7)<\/p>\n<p>&#8211; `schema_keys`: Number of fields (typically 5-15, max 40+)<\/p>\n<p>&#8211; `schema_elements`: Total structural pieces<\/p>\n<p>&#8211; `schema_cyclomatic_complexity`: Branching complexity from `oneOf`, `anyOf`, etc.<\/p>\n<p>&#8211; `schema_complexity_score`: Weighted aggregate difficulty metric<\/p>\n<p>All metrics based on [SLOT: Structuring the Output of LLMs](<a href=\"https:\/\/arxiv.org\/abs\/2505.04016v1\">https:\/\/arxiv.org\/abs\/2505.04016v1<\/a>)<\/p>\n<p><strong>Data Quality<\/strong>:<\/p>\n<p>&#8211; <strong>Heavily balanced<\/strong>: Cleaned from 9M raw events to 100k diverse examples<\/p>\n<p>&#8211; <strong>Real-world distribution<\/strong>: Includes simple extractions and gnarly complex schemas<\/p>\n<p>&#8211; <strong>Validation annotations<\/strong>: `response_is_valid` field tells you when LLMs fail<\/p>\n<p>&#8211; <strong>Complexity correlation<\/strong>: More complex schemas = lower validation rates (thresholds identified)<\/p>\n<p><strong>Key Findings<\/strong>:<\/p>\n<p>&#8211; 93% average validation rate across all schemas<\/p>\n<p>&#8211; Complex schemas cause noticeable degradation (non-linear drop-off)<\/p>\n<p>&#8211; Response size heavily correlates with execution time<\/p>\n<p>&#8211; 90% of schemas have &lt;20 keys and depth &lt;5<\/p>\n<p>&#8211; Top 10% contain the truly difficult extraction tasks<\/p>\n<p><strong>Use Cases:<\/strong><\/p>\n<p>&#8211; Fine-tuning models for structured data extraction<\/p>\n<p>&#8211; Analyzing LLM failure patterns on complex schemas<\/p>\n<p>&#8211; Understanding real-world schema complexity distribution<\/p>\n<p>&#8211; Benchmarking extraction accuracy and speed<\/p>\n<p>&#8211; Training models that handle edge cases better<\/p>\n<p>&#8211; Studying correlation between schema complexity and output validity<\/p>\n<p><strong>The Real Story:<\/strong><\/p>\n<p>This dataset reflects actual open-source usage patterns &#8211; not pre-filtered or curated. You see the mess:<\/p>\n<p>&#8211; Schema duplication (some schemas used millions of times)<\/p>\n<p>&#8211; Diverse complexity levels (from simple price extraction to full articles)<\/p>\n<p>&#8211; Real failure cases (7% of responses don&#8217;t match their schemas)<\/p>\n<p>&#8211; Validation is syntactic only (semantically wrong but valid JSON passes)<\/p>\n<p><strong>Load It:<\/strong><\/p>\n<pre><code>from datasets import load_dataset dataset = load_dataset(\"scrapegraphai\/sgai-100k\") <\/code><\/pre>\n<p>This is the kind of dataset that&#8217;s actually useful for ML work &#8211; messy, real, and representative of actual problems people solve.<\/p>\n<\/div>\n<p><!-- SC_ON -->   submitted by   <a href=\"https:\/\/www.reddit.com\/user\/Electrical-Signal858\"> \/u\/Electrical-Signal858 <\/a> <br \/> <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1ptoo7t\/scrapegraphai_100k_100000_realworld_structured\/\">[link]<\/a><\/span>   <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1ptoo7t\/scrapegraphai_100k_100000_realworld_structured\/\">[comments]<\/a><\/span><\/p><div class='watch-action'><div class='watch-position align-right'><div class='action-like'><a class='lbg-style1 like-37348 jlk' href='javascript:void(0)' data-task='like' data-post_id='37348' data-nonce='65e0e39b87' rel='nofollow'><img class='wti-pixel' src='https:\/\/www.graviton.at\/letterswaplibrary\/wp-content\/plugins\/wti-like-post\/images\/pixel.gif' title='Like' \/><span class='lc-37348 lc'>0<\/span><\/a><\/div><\/div> <div class='status-37348 status align-right'><\/div><\/div><div class='wti-clear'><\/div>","protected":false},"excerpt":{"rendered":"<p># r\/datasets &#8211; ScrapeGraphAI 100k Post Announcing ScrapeGraphAI 100k &#8211; a dataset of 100,000 real-world structured extraction&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[85],"tags":[],"class_list":["post-37348","post","type-post","status-publish","format-standard","hentry","category-datatards","wpcat-85-id"],"_links":{"self":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/37348","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/comments?post=37348"}],"version-history":[{"count":0,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/37348\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/media?parent=37348"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/categories?post=37348"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/tags?post=37348"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}