Struggling To Extract Data From 1,500+ Mixed Scanned/digital PDFs. Tesseract, OCR, And Vision LLMs All Failing. Need Advice.

Hi everyone,

I am working on my thesis and I have a dataset of about 1,500 PDF reports from the DGHS (Health Services). I need to extract specific table rows (District-wise Dengue stats) from them.

The Problem: The PDFs are a nightmare mix. Some are digital with selectable text, but many are low-quality scans or photos of paper reports. The fonts are often garbled (mojibake) when extracted as text, and the layout changes slightly between years.

What I have tried so far (and why it failed):

Tesseract OCR: It struggled hard with the Bengali/English mix and the table borders. The output was mostly noise.
Standard PDF scraping (pdfplumber/PyPDF): Works on the digital files, but returns garbage characters (e.g., Kg‡dvU instead of “Chittagong”) due to bad font encoding in the source files.
Ollama (Llama 3.1 & MiniCPM-V):
- Llama 3.1 (Text): Hallucinates numbers or crashes when it sees the garbled text.
- MiniCPM-V (Vision): This was my best bet. I wrote a script to convert pages to images and feed them to the model. It works for about 10 files, but then it starts hallucinating or missing rows entirely, and it’s very slow.

The Goal: I just need to reliably extract the District Name, New Cases, Total Cases, and Deaths for a specific division (Chittagong) into a CSV.

I have attached a screenshot of one of the “bad” scanned pages.

Has anyone successfully processed a mixed-quality dataset like this? Should I be fine-tuning a small model, or is there a specific OCR pipeline (like PaddleOCR or DocumentAI) that handles this better than raw LLMs?

Any pointers would be a lifesaver. I’m drowning in manual data entry right now.

submitted by /u/deletedusssr
[link] [comments]

0

How Do You Efficiently Pre-filter And Group WhatsApp Numbers To Boost Engagement?

Hey everyone,

Lately I’ve been playing around with a little workflow for screening WhatsApp numbers. The idea’s pretty simple: figure out which numbers are actually active and get a sense of engagement, without bothering anyone. It’s super handy if you need to quickly group contacts or analyze interaction rates.

I realized that just four fields can filter out around 60% of low-value numbers: number | last_seen | replied | bounce.

I wrote a few simple scripts for pre-filtering, but some steps felt kinda repetitive, so I started using a small tool (TNTwuYou) to handle list validation and reply tracking.

Some things I’ve tried:

Sorting numbers by last active date, so you hit the active folks first.
Grouping contacts based on reply status.
Using simple scripts with data to get a clear picture of which regions or types of people are more likely to engage.

Has anyone done reply probability scoring?

Do you base it on a time window or historical reply rate?
Anyone tried using graph or clustering methods for grouping contacts?

submitted by /u/Suspicious_Prior4515
[link] [comments]

0

I Made A Website That Showcases The 311 Requests Dataset

311wrapped.com

submitted by /u/eltokh7
[link] [comments]

0

Has Anyone Tried Letting AI Agents Access Your Data And Pay Per Request?

I’m curious whether anyone here has actually tried letting AI agents directly access a dataset or data API and pay based on usage (e.g. per request or per query).

I’ve seen ideas around usage-based APIs and agent tool-calling, but I’m not sure how this works in practice when the client isn’t a human. Did it make sense economically? Were abuse, pricing, or access control big issues?

Would love to hear if people have similar ideas with this or decided it wasn’t worth giving a try.

submitted by /u/Shot_Fudge_6195
[link] [comments]

0

Dataset Of 5k High-quality Trivia Questions Pulled From Open Trivia

https://github.com/leakyhose/open-trivia-dataset

Pulled it from open trivia database, they lock the questions behind an API call that only returns 50 each time. Ran a script that repeatedly calls it, storing the questions and sorting them by difficulty and category.

submitted by /u/ishotapig
[link] [comments]

0

Open Source Or Cheap Alternative To GICS/ICB Security Industry Sectors

GICS (The Global Industry Classification Standard from MSCI) and ICB (Industry Classification Benchmark from FTSE/LSE/Dow Jones) seem to dominate the securities industry sector data market.

There are alternatives available from players such at ICE, but in all cases, they are proprietary, and as far as i can tell pretty much identical.

11 top level sectors, which are then split into more and more granular sub-categories.

I’m fairly certain that nobody really has any use for the most granular sub-sectors which contain >160 sectors… But the high and mid level classifications would be really useful.

You can theoretically grab sector weightings data from Yahoo Finance by ticker code… But i’d ideally like to be able to use either Sedol or ISIN to look values up.

I’m sure there are others who would like something like this, so before i think about trying to create my own gizmo for it i was wondering if anybody has done anything similar?

submitted by /u/VivicaFromGsyEh
[link] [comments]

0

Where Do I Get A Huge Amount Of Data For Nmap?

submitted by /u/SubstanceWrong6878
[link] [comments]

0

[PAID] I Compiled A Clean JSON Dataset Of All Japanese Prefectures And 1,700+ Cities For Developers [self-promotion]

I’m working on a project that required accurate hierarchical Japanese location data
(prefecture → city/ward/town/village).

Since most publicly available datasets were outdated, inconsistent, or missing entries,
I compiled a clean version from multiple official sources.

It includes:

1 country
47 prefectures
1,700+ municipalities
consistent hierarchical IDs
UTF-8, machine-friendly
suitable for forms, address validation, GIS, ML, and location-based apps

If anyone is interested, I’m happy to provide details or export it as CSV / SQL.

The full JSON dataset is available here (paid):
https://makotocroco.gumroad.com/l/japan-locations

(self-promotion: this is my own dataset)

submitted by /u/Specialist-Weight407
[link] [comments]

0

Dataset Release: Real Structural Engineering Drawings For AI (PNED – 6 RC Datasets)

Hi everyone,

I’ve been working as a structural engineer for about 10 years (Germany, RC design).
Over the last few years I’ve noticed something very surprising in AI/ML:

We have datasets for almost everything — but none for real structural engineering drawings.

These drawings are extremely challenging for machine learning due to:

dense, overlapping geometry
structural symbols and reinforcement notation
dimensions, leaders, section markers
multi-layer technical detailing
scale-dependent information
mixed text + geometry + symbols

Because of this, they are highly relevant for:

OCR / document understanding
object detection
layout analysis
symbol recognition
segmentation
BIM automation
engineering-focused CV research

So I started building a series of datasets of real reinforced-concrete drawings, created specifically for ML tasks.

Each dataset contains:

25 PDF engineering drawings (Columns 50 PDF)
25 PNG images (1200 dpi) (Columns 50 PDF)
one structural category per dataset (RC beams, walls, foundations, columns, precast columns, etc.)

So far I’ve released 6 datasets:

RC Beams V1
RC Columns V1
RC Foundations V1
RC Precast Columns V1
RC Walls V1
RC Walls V2

All datasets, including sample images, can be viewed here:

👉 [https://huggingface.co/PNEngineeringDatasets]()

I’d be happy to hear any feedback, suggestions or use cases you think could be valuable for ML research in this domain.

Disclaimer: this is my own dataset project; posting once for visibility.

submitted by /u/PNEngineeringDataset
[link] [comments]

0

Where To Find Monolingual Dictionary Dataset For Multiple Languages

Hello guys. Any idea where I could get a free dataset containing monolingual dictionary (word- definition pairs in the same language) in multiple languages? I got english from kaikki(wiktionary) but it is missing other language ‘senses’. WordNet might be no good, since I need sensible definitions. I’m considering making it myself from the wiktionary dumps of different languages, but I thought it might be better to ask first

submitted by /u/JamesAntoni
[link] [comments]

0

Searching Publicly Available Multimodal Health Related Dataset

Would you please help me finding publicly available multimodal (image, audio or sensors) healthcare related datasets for novel research?

submitted by /u/Objective-Meat2499
[link] [comments]

0

Interlock — A Circuit-breaker & Certification System For RAG + Vector DBs, With Stress-chamber Validation And Signed Forensic Evidence (code + Results) (advanced Free Data Tool) Feedback Pls

Interlock is a safety layer for production AI stacks that does three things: detects degradation/hazard, refuses or degrades responses when confidence is low, and records cryptographically verifiable evidence of the intervention. The repo includes middleware (Express, FastAPI), adapters for 6 vector DBs, CI-driven stress chamber tests, benchmarks, and certified badges with signatures. Repo & quickstart: https://github.com/CULPRITCHAOS/Interlock

What’s novel / useful from an ML perspective

Formal primitives (Hazard, Reflex, Guard, State, Confidence, Trust Decay) to reason about operating envelopes for LLM/RAG systems.

Stress-chamber + production-simulation CI workflows that inject latency/errors to evaluate recovery & cascade risk.

Evidence-over-claims approach: signed artifacts that let you prove interventions happened and why — useful for audits, incident triage, and model governance.

Restart continuity: protection survives process restarts (addresses anti-amnesia).

Key experimental results (from v5.3 README)

False negative rate: 0% in validated scenarios

False positive rate: 4.0% (tradeoff to reduce silent corruption)

Recovery time mean: 52.3s, P95 ≈ 58.3s

Zero cascading failures & zero data loss in tests

What you can find in the repo

Middleware for Express and FastAPI to add Interlock to existing stacks

Stress chamber scripts that run protected vs control comparative experiments

Benchmark suite and artifact retention of evidence and certification badges

Live-monitor reference service and scripts to reproduce injected failures

Documentation: primitives, validation artifacts, case study, and live incidents

Why this matters for ML ops & research

Bridges the gap between research on LLM calibration / confidence and production safety tooling.

Provides a repeatable evaluation pipeline for failure‑survivability and impact analysis (including economic impact reports).

Enables measurable trade-offs (false positives vs safety) with reproducible artifacts to tune policies.

Suggested experiments or avenues for feedback

Calibration strategies that reduce FPR while keeping FN≈0

Alternative reflex actions (partial answer + flagged sections vs full refusal)

Integration with downstream retraining / feedback loops using forensic logs

Domain-specific thresholds (healthcare / finance) and legal/compliance validation

This is MY FIRST INFRA PROJECT and a new coder. Any suggestions or feedback I’d GREATLY APPRECIATE IT!

submitted by /u/CulpritChaos
[link] [comments]

0

How To Determine A Value For A Question In A Survey

Hello,

I want to get some opinions and recommendations on statistical methods that could be used for my analysis.

The plan is to collect data through a survey and a database search. The results will be used as input and output for Data Envelopment Analysis (DEA). The target of the survey is a decision-making unit (DMU).

There are eight input items and two output items. The score for the input items will be based on the survey responses received. For output items, the score will be calculated using data from public databases.

Each item comprises questions with different types of answers. These include yes/no questions, questions where you select one of statements 1–5, and numerical questions. The number of questions for each item varies depending on its specific characteristics.

This is how I grade each answer and calculate the total score for each item.

Scoring answers:

Type A question: yes/no, YES is given score 3, NO is given score 1

Type B question: A score from 1 to 5 is given based on the score of the selected answer

Type C question: numerical question. The number (n) will be given a score based on the calculation of the mean/median of all the collected answers. If n < Q2, the score is 1; if n = Q2, the score is 2; and if n > Q2, the score is 3.

I then sum up the grades from all the questions in each item. The final score for an item is = total grade/max grade*5 (I set the highest score for an item as 5)

A radar chart for a DMU will be developed showing the scores of the 8 input items.

For the output items:

The data is derived from a public database. I classify the data from each DMU into one of four groups based on quality.

Group	HHQ	HQ	LQ	LLQ
DMU1	XX	XX	XX	XX
DMU2	XX	XX	XX	XX
DMU3	XX	XX	XX	XX

Mean/median	XX	XX	XX	XX

For the scoring:

derive the frequency number from database
calculate the median for each group
set the grade as 1 to 3 (same as the type C question)

Group	HHQ	HQ	LQ	LLQ
DMU1	1	3	3	2
DMU2	3	2	2	3
DMU3	3	1	2	2

4.Because I want to give different weights to each group so that the data from the high-quality group contributes more to the total score. A multiplication factor depending on the group will be applied to each grade, as follows:

Output1

Group	HHQ	HQ	LQ	LLQ	Output1 value
DMU1	1 * 5	3 *3	3 *2	2	=Sum/Max sum*5
DMU2	3 * 5	2 *3	2 *2	3	=Sum/Max sum*5
DMU3	3 * 5	1*3	2 *2	2	=Sum/Max sum*5

This is how I set the input and output values for each DMU.

Question:

Is this kind of scoring acceptable, even when there are different types of questions for each input item?
Is there a scientific method that can be applied here? For example, how should the score for each answer be set? I have found papers that use scoring in their surveys, but their questions are usually of the same type, producing the same type of answer (e.g. a Likert scale).

Any comments or advice would be appreciated, also if anyone can recommend me any references that would be awesome.

Thank you.
marlee

submitted by /u/Fast-Rise17
[link] [comments]

0

Portuguese Dataset For Training A Chat Model

I need a chat dataset to train a model like these friends or virtual girlfriend I want it to be able to enter into a conversation in turns

submitted by /u/oversolan007
[link] [comments]

0

Historical Canadian Infectious Disease Data

submitted by /u/F0urLeafCl0ver
[link] [comments]

0

Unpopular Opinion: If It’s On The Public Web, It’s Scrapeable. Change My Mind.

submitted by /u/Warm_Talk3385
[link] [comments]

0

Tomato Leaf Dataset Containing Environmental Conditions Such As Different Humidity And Lightning Factors

Hello I’m looking for a tomato leaf dataset for environmental conditions such as high/low humidity and lightning for my thesis. Most of the datasets on web focuses on diseases. Can anyone help please, thanks!

submitted by /u/DivergentG
[link] [comments]

0

Looking For Wheat Yellow Rust Image Datasets For ML Project (with Metadata)

We’re undergraduate Machine Learning students working on a crop disease generation project using CGANs, aimed at supporting global sustainability. 🌱

Right now, we’re looking for wheat images of yellow rust disease along with metadata like region, severity, and time range for model training and evaluation.

If you know of any public datasets, research projects, or institutional resources, or even just pointers on where to look, we’d really appreciate your guidance.

Thanks so much for your help! Any leads will be credited in our project.

submitted by /u/Plane_Race_840
[link] [comments]

0

Does A Corpus Of Archaic English Words Exist?

I have a large database/wordlist containing probably every English dictionary word plus many additional ones like brand names, but this naturally includes many words no longer in use. I need to cut down the size of the list, but since too many words have been added to it to start from scratch, my plan is to obtain a corpus of only archaic words and use these as negatives to remove from the main wordlist. Does such a corpus/wordlist exist anywhere in text form, even it’s just a word per line? Thank you in advance, any help is much appreciated.

submitted by /u/SheffieldParadox
[link] [comments]

0

Looking For A Long-term Collaborator – Data Engineer / Backend Engineer (Automotive Data)

We are building an automotive vehicle check platform focused on the European market and we are looking for a long-term technical collaborator, not a one-off freelancer.

Our goal is to collect, structure, and expose automotive-related data that can be included in vehicle history / verification reports.

We are particularly interested in sourcing and integrating:

Vehicle recalls / technical campaigns / service recalls, using public sources such as RAPEX (EU Safety Gate)
Commercial use status (e.g. taxi, ride-hailing, fleet usage), where this can be inferred from public or correlatable data
Safety ratings, especially Euro NCAP (free source)
Any other publicly available or correlatable automotive data that adds real value to a vehicle check report

What we are looking for:

Experience with data extraction, web scraping, or data engineering
Ability to deliver structured data (JSON / database) and ideally expose it via API
Focus on data quality, reliability, and long-term maintainability
Interest in a long-term collaboration, not short-term gigs

Context:

European market focus
Product-oriented project with real-world usage

If this sounds interesting, feel free to comment or send a DM with a short intro and relevant experience.

submitted by /u/cauchyez
[link] [comments]

0

What Packaging And Terms Make A Dataset Truly “enterprise-friendly”?

I am trying to define what makes a dataset “enterprise-ready” versus just a dump of files. Regarding structure, do you generally prefer one monolithic archive or segmented collections with manifests? I’m also looking for best practices on taxonomy. How do you expect keywords and tags to be formatted for the easiest integration into your systems?

One of the biggest friction points seems to be legal clarity. What is the clearest way to express restrictions, such as allowed uses, no redistribution, or retention limits, so that engineers can understand them without needing a lawyer to parse the file every time?

If you have seen examples of “gold standard” dataset documentation that handles this perfectly, I would love to see them.

Thanks again guys for the help!

submitted by /u/Lost_Transportation1
[link] [comments]

0

For Large Web‑scraped Datasets In 2025 – Are You Team Pandas Or Polars?

submitted by /u/Warm_Talk3385
[link] [comments]

0

Update To This: In The Google Drive There Are Currently Two Csv Files In The Top Folder. One Is The Raw Dataset. The Other Is A Dataset That Has Been Deduplicated. Right Now, I Am Running A Script That Tries To Repair The OCR Noise And Mistakes. That Will Also Be Uploaded As A Unique Dataset.

submitted by /u/Ok-District-1330
[link] [comments]

0

Looking For Dataset For AI Interview / Behavioral Analysis (Johari Window)

Hi, I’m working on a university project building an AI-based interview system (technical + HR). I’m specifically looking for datasets related to interview questions, interview responses, or behavioral/self-awareness analysis that could be mapped to concepts like the Johari Window (Open/Blind/Hidden/Unknown).

Most public datasets I’ve found focus only on question generation, not behavioral or self-awareness labeling.
If anyone knows of relevant datasets, research papers, or even similar projects, I’d really appreciate pointers.

Thanks!

submitted by /u/Connect_Length6153
[link] [comments]

0

ScrapeGraphAI 100k: 100,000 Real-World Structured LLM Output Examples From Production Usage

# r/datasets – ScrapeGraphAI 100k Post

Announcing ScrapeGraphAI 100k – a dataset of 100,000 real-world structured extraction examples from the open-source ScrapeGraphAI library:

https://huggingface.co/datasets/scrapegraphai/scrapegraphai-100k

What’s Inside:

This is raw production data – not synthetic, not toy problems. Derived from 9 million PostHog events collected from real users of ScrapeGraphAI during Q2-Q3 2025.

Every example includes:

– `prompt`: Actual user instructions sent to the LLM

– `schema`: JSON schema defining expected output structure

– `response`: What the LLM actually returned

– `content`: Source web content (markdown)

– `llm_model`: Which model was used (89% gpt-4o-mini)

– `source`: Source URL

– `execution_time`: Real timing data

– `response_is_valid`: Ground truth validation (avg 93% valid)

Schema Complexity Metrics:

– `schema_depth`: Nesting levels (typically 2-4, max ~7)

– `schema_keys`: Number of fields (typically 5-15, max 40+)

– `schema_elements`: Total structural pieces

– `schema_cyclomatic_complexity`: Branching complexity from `oneOf`, `anyOf`, etc.

– `schema_complexity_score`: Weighted aggregate difficulty metric

All metrics based on [SLOT: Structuring the Output of LLMs](https://arxiv.org/abs/2505.04016v1)

Data Quality:

– Heavily balanced: Cleaned from 9M raw events to 100k diverse examples

– Real-world distribution: Includes simple extractions and gnarly complex schemas

– Validation annotations: `response_is_valid` field tells you when LLMs fail

– Complexity correlation: More complex schemas = lower validation rates (thresholds identified)

Key Findings:

– 93% average validation rate across all schemas

– Complex schemas cause noticeable degradation (non-linear drop-off)

– Response size heavily correlates with execution time

– 90% of schemas have <20 keys and depth <5

– Top 10% contain the truly difficult extraction tasks

Use Cases:

– Fine-tuning models for structured data extraction

– Analyzing LLM failure patterns on complex schemas

– Understanding real-world schema complexity distribution

– Benchmarking extraction accuracy and speed

– Training models that handle edge cases better

– Studying correlation between schema complexity and output validity

The Real Story:

This dataset reflects actual open-source usage patterns – not pre-filtered or curated. You see the mess:

– Schema duplication (some schemas used millions of times)

– Diverse complexity levels (from simple price extraction to full articles)

– Real failure cases (7% of responses don’t match their schemas)

– Validation is syntactic only (semantically wrong but valid JSON passes)

Load It:

from datasets import load_dataset dataset = load_dataset("scrapegraphai/sgai-100k")

This is the kind of dataset that’s actually useful for ML work – messy, real, and representative of actual problems people solve.

submitted by /u/Electrical-Signal858
[link] [comments]

0

Backing Up Spotify

submitted by /u/hypd09
[link] [comments]

0

Football (Soccer) Data – Players (without Game Analysis)

Hi,

Loking for a dataset / API that contains information about Football players, their nationalities, clubs they played at, their coaches and their individual & team trophies.

Most of the API-s / Datasets out there are either, oriented on the football and game tactical analysis, or transfer market, so I could not find reliable data source.

Tried Transfermarkt data but it has a lot of inaccuracies, and it has limited history. Need something rather comprehensive.

Any tips?

submitted by /u/orm_the_stalker
[link] [comments]

0

Looking To Make Video Game Datasets By Reading Game Memory.

I have been trying to find a way to get into the Fortnite kernel so that I can record myself playing and have the automatic annotations, hopefully, as well as the perfect character representation from reading the memory.

Any tips to get around easy Anti-Cheat? no injection just reading.

submitted by /u/Crazy_Armadillo_8976
[link] [comments]

0

Identifying High Growth Github Repositories

I’m trying to identify repositories that are growing the fastest in GitHub and came across gharchive.org. Has anyone used this before / have a better solution?

submitted by /u/-Zubzii-
[link] [comments]

0

I’m Trying To “Moneyball” US High Schools To See Which Ones Are Actually D1 Athlete Factories. Is There A Clean Dataset For This?

I’ve gone down a rabbit hole trying to analyze the “Athlete ROI” of different zip codes. Basically, I want to build a heatmap that shows which high schools are statistically over-performing at sending kids to college on athletic scholarships (specifically D1/D2 commits). My theory is that there are “hidden gem” public schools that produce just as many elite athletes as the $50k/year private academies, but the data is impossible to visualize because it’s all locked in individual profiles. I’ve looked at MaxPreps, 247Sports, and Rivals, but they are designed for tracking single players, not analyzing school output at scale. The Question: Does anyone know of an aggregate dataset (or a paid API) that links: High School Name / Zip Code Total Commits per year (broken down by D1 vs D2 if possible) Sport Category

I’m trying to avoid writing a scraper to crawl 20,000 school pages if a clean database already exists. Has anyone worked with recruitment data like this before?

submitted by /u/Dry-Town7979
[link] [comments]

0

Category: Datatards