Made My First Dataset! Ca. 100 Scanned Pages Of Books From 1910-1920, Serbian Cyrillic. Kaggle And HF

Hi everyone, first time building a dataset. This is a v0.1, about 100 scans of book pages (both single and double-page per scan). The books are in the public domain. The intended use is for anyone looking to do image-to-text software work.

The scans are in a .jpg format, with a PDF with the whole collection.

I have also included 2 .txt files:

1)”raw” (aka not corrected for halluciations, artifacts, etc.) .txt file for anyone looking to do a check. The file is in Markdown.

2) A “corrected” .txt file, where the hallucinations, artifacts, errors, etc. were manually corrected. This file is in .txt, not Markdown.

Looking for feedback if this is useful, how to make a dataset like this better, etc.

Kaggle: https://www.kaggle.com/datasets/booksofjeremiah/serbian-cyrillic-script-printed

Huggingface: https://huggingface.co/datasets/Books-of-Jeremiah/raw-OCR-serbian-cyrillic

Any feedback on whether the set is useful for other use cases or how it can be made better is appreciated!

submitted by /u/Books_Of_Jeremiah
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *