We Benchmarked 18 LLMs On OCR (7k+ Calls) — Cheaper/old Models Oftentimes Win. Full Dataset + Framework Open-sourced.

TLDR; We were overpaying for OCR, so we compared flagship models with cheaper and older models by creating a new, curated dataset including standard documents you’d find in real-world industry.

We’ve been looking at OCR / document extraction workflows and kept seeing the same pattern:

Too many teams are either stuck in legacy OCR pipelines, or are overpaying badly for LLM calls by defaulting to the newest/ biggest model.

We put together a curated set of 42 standard documents and ran every model 10 times under identical conditions; 7,560 total calls. Main takeaway: for standard OCR, smaller and older models match premium accuracy at a fraction of the cost.

We track pass^n (reliability at scale), cost-per-success, latency, and critical field accuracy.

All documents are non-redacted due to synthetic data. Yet, all documents are real-world representative because their information density is similar, only the actual data content is synthetic.

  • Invoices
  • Transport orders
  • Bills of Lading
  • Receipts (from CORU dataset)

Dataset Hugginface: https://huggingface.co/datasets/Timokerr/OCR_baseline

Benchmark Harness Repo: https://github.com/ArbitrHq/ocr-mini-bench

Curious whether this matches what others here are seeing.

submitted by /u/TimoKerre
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *