# r/datasets – ScrapeGraphAI 100k Post
Announcing ScrapeGraphAI 100k – a dataset of 100,000 real-world structured extraction examples from the open-source ScrapeGraphAI library:
https://huggingface.co/datasets/scrapegraphai/scrapegraphai-100k
What’s Inside:
This is raw production data – not synthetic, not toy problems. Derived from 9 million PostHog events collected from real users of ScrapeGraphAI during Q2-Q3 2025.
Every example includes:
– `prompt`: Actual user instructions sent to the LLM
– `schema`: JSON schema defining expected output structure
– `response`: What the LLM actually returned
– `content`: Source web content (markdown)
– `llm_model`: Which model was used (89% gpt-4o-mini)
– `source`: Source URL
– `execution_time`: Real timing data
– `response_is_valid`: Ground truth validation (avg 93% valid)
Schema Complexity Metrics:
– `schema_depth`: Nesting levels (typically 2-4, max ~7)
– `schema_keys`: Number of fields (typically 5-15, max 40+)
– `schema_elements`: Total structural pieces
– `schema_cyclomatic_complexity`: Branching complexity from `oneOf`, `anyOf`, etc.
– `schema_complexity_score`: Weighted aggregate difficulty metric
All metrics based on [SLOT: Structuring the Output of LLMs](https://arxiv.org/abs/2505.04016v1)
Data Quality:
– Heavily balanced: Cleaned from 9M raw events to 100k diverse examples
– Real-world distribution: Includes simple extractions and gnarly complex schemas
– Validation annotations: `response_is_valid` field tells you when LLMs fail
– Complexity correlation: More complex schemas = lower validation rates (thresholds identified)
Key Findings:
– 93% average validation rate across all schemas
– Complex schemas cause noticeable degradation (non-linear drop-off)
– Response size heavily correlates with execution time
– 90% of schemas have <20 keys and depth <5
– Top 10% contain the truly difficult extraction tasks
Use Cases:
– Fine-tuning models for structured data extraction
– Analyzing LLM failure patterns on complex schemas
– Understanding real-world schema complexity distribution
– Benchmarking extraction accuracy and speed
– Training models that handle edge cases better
– Studying correlation between schema complexity and output validity
The Real Story:
This dataset reflects actual open-source usage patterns – not pre-filtered or curated. You see the mess:
– Schema duplication (some schemas used millions of times)
– Diverse complexity levels (from simple price extraction to full articles)
– Real failure cases (7% of responses don’t match their schemas)
– Validation is syntactic only (semantically wrong but valid JSON passes)
Load It:
from datasets import load_dataset dataset = load_dataset("scrapegraphai/sgai-100k")
This is the kind of dataset that’s actually useful for ML work – messy, real, and representative of actual problems people solve.
submitted by /u/Electrical-Signal858
[link] [comments]