{"id":37453,"date":"2025-12-26T18:27:59","date_gmt":"2025-12-26T17:27:59","guid":{"rendered":"https:\/\/www.graviton.at\/letterswaplibrary\/struggling-to-extract-data-from-1500-mixed-scanned-digital-pdfs-tesseract-ocr-and-vision-llms-all-failing-need-advice\/"},"modified":"2025-12-26T18:27:59","modified_gmt":"2025-12-26T17:27:59","slug":"struggling-to-extract-data-from-1500-mixed-scanned-digital-pdfs-tesseract-ocr-and-vision-llms-all-failing-need-advice","status":"publish","type":"post","link":"https:\/\/www.graviton.at\/letterswaplibrary\/struggling-to-extract-data-from-1500-mixed-scanned-digital-pdfs-tesseract-ocr-and-vision-llms-all-failing-need-advice\/","title":{"rendered":"Struggling To Extract Data From 1,500+ Mixed Scanned\/digital PDFs. Tesseract, OCR, And Vision LLMs All Failing. Need Advice."},"content":{"rendered":"<p><!-- SC_OFF --><\/p>\n<div class=\"md\">\n<p>Hi everyone,<\/p>\n<p>I am working on my thesis and I have a dataset of about 1,500 PDF reports from the DGHS (Health Services). I need to extract specific table rows (District-wise Dengue stats) from them.<\/p>\n<p><strong>The Problem:<\/strong> The PDFs are a nightmare mix. Some are digital with selectable text, but many are low-quality scans or photos of paper reports. The fonts are often garbled (mojibake) when extracted as text, and the layout changes slightly between years.<\/p>\n<p><strong>What I have tried so far (and why it failed):<\/strong><\/p>\n<ol>\n<li><strong>Tesseract OCR:<\/strong> It struggled hard with the Bengali\/English mix and the table borders. The output was mostly noise.<\/li>\n<li><strong>Standard PDF scraping (pdfplumber\/PyPDF):<\/strong> Works on the digital files, but returns garbage characters (e.g., <code>Kg\u2021dvU<\/code> instead of &#8220;Chittagong&#8221;) due to bad font encoding in the source files.<\/li>\n<li><strong>Ollama (Llama 3.1 &amp; MiniCPM-V):<\/strong>\n<ul>\n<li><em>Llama 3.1 (Text):<\/em> Hallucinates numbers or crashes when it sees the garbled text.<\/li>\n<li><em>MiniCPM-V (Vision):<\/em> This was my best bet. I wrote a script to convert pages to images and feed them to the model. It works for about 10 files, but then it starts hallucinating or missing rows entirely, and it&#8217;s very slow.<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<p><strong>The Goal:<\/strong> I just need to reliably extract the <strong>District Name<\/strong>, <strong>New Cases<\/strong>, <strong>Total Cases<\/strong>, and <strong>Deaths<\/strong> for a specific division (Chittagong) into a CSV.<\/p>\n<p>I have attached a screenshot of one of the &#8220;bad&#8221; scanned pages.<\/p>\n<p>Has anyone successfully processed a mixed-quality dataset like this? Should I be fine-tuning a small model, or is there a specific OCR pipeline (like PaddleOCR or DocumentAI) that handles this better than raw LLMs?<\/p>\n<p>Any pointers would be a lifesaver. I&#8217;m drowning in manual data entry right now.<\/p>\n<\/div>\n<p><!-- SC_ON -->   submitted by   <a href=\"https:\/\/www.reddit.com\/user\/deletedusssr\"> \/u\/deletedusssr <\/a> <br \/> <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1pw8om4\/struggling_to_extract_data_from_1500_mixed\/\">[link]<\/a><\/span>   <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1pw8om4\/struggling_to_extract_data_from_1500_mixed\/\">[comments]<\/a><\/span><\/p><div class='watch-action'><div class='watch-position align-right'><div class='action-like'><a class='lbg-style1 like-37453 jlk' href='javascript:void(0)' data-task='like' data-post_id='37453' data-nonce='65e0e39b87' rel='nofollow'><img class='wti-pixel' src='https:\/\/www.graviton.at\/letterswaplibrary\/wp-content\/plugins\/wti-like-post\/images\/pixel.gif' title='Like' \/><span class='lc-37453 lc'>0<\/span><\/a><\/div><\/div> <div class='status-37453 status align-right'><\/div><\/div><div class='wti-clear'><\/div>","protected":false},"excerpt":{"rendered":"<p>Hi everyone, I am working on my thesis and I have a dataset of about 1,500 PDF&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[85],"tags":[],"class_list":["post-37453","post","type-post","status-publish","format-standard","hentry","category-datatards","wpcat-85-id"],"_links":{"self":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/37453","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/comments?post=37453"}],"version-history":[{"count":0,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/37453\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/media?parent=37453"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/categories?post=37453"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/tags?post=37453"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}