ScrapeGraphAI 100k: 100,000 Real-World Structured LLM Output Examples From Production Usage

# r/datasets – ScrapeGraphAI 100k Post

Announcing ScrapeGraphAI 100k – a dataset of 100,000 real-world structured extraction examples from the open-source ScrapeGraphAI library:

https://huggingface.co/datasets/scrapegraphai/scrapegraphai-100k

What’s Inside:

This is raw production data – not synthetic, not toy problems. Derived from 9 million PostHog events collected from real users of ScrapeGraphAI during Q2-Q3 2025.

Every example includes:

– `prompt`: Actual user instructions sent to the LLM

– `schema`: JSON schema defining expected output structure

– `response`: What the LLM actually returned

– `content`: Source web content (markdown)

– `llm_model`: Which model was used (89% gpt-4o-mini)

– `source`: Source URL

– `execution_time`: Real timing data

– `response_is_valid`: Ground truth validation (avg 93% valid)

Schema Complexity Metrics:

– `schema_depth`: Nesting levels (typically 2-4, max ~7)

– `schema_keys`: Number of fields (typically 5-15, max 40+)

– `schema_elements`: Total structural pieces

– `schema_cyclomatic_complexity`: Branching complexity from `oneOf`, `anyOf`, etc.

– `schema_complexity_score`: Weighted aggregate difficulty metric

All metrics based on [SLOT: Structuring the Output of LLMs](https://arxiv.org/abs/2505.04016v1)

Data Quality:

Heavily balanced: Cleaned from 9M raw events to 100k diverse examples

Real-world distribution: Includes simple extractions and gnarly complex schemas

Validation annotations: `response_is_valid` field tells you when LLMs fail

Complexity correlation: More complex schemas = lower validation rates (thresholds identified)

Key Findings:

– 93% average validation rate across all schemas

– Complex schemas cause noticeable degradation (non-linear drop-off)

– Response size heavily correlates with execution time

– 90% of schemas have <20 keys and depth <5

– Top 10% contain the truly difficult extraction tasks

Use Cases:

– Fine-tuning models for structured data extraction

– Analyzing LLM failure patterns on complex schemas

– Understanding real-world schema complexity distribution

– Benchmarking extraction accuracy and speed

– Training models that handle edge cases better

– Studying correlation between schema complexity and output validity

The Real Story:

This dataset reflects actual open-source usage patterns – not pre-filtered or curated. You see the mess:

– Schema duplication (some schemas used millions of times)

– Diverse complexity levels (from simple price extraction to full articles)

– Real failure cases (7% of responses don’t match their schemas)

– Validation is syntactic only (semantically wrong but valid JSON passes)

Load It:

from datasets import load_dataset dataset = load_dataset("scrapegraphai/sgai-100k") 

This is the kind of dataset that’s actually useful for ML work – messy, real, and representative of actual problems people solve.

submitted by /u/Electrical-Signal858
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *