Hello 🤗
I want to build a dataset of manipulated documents with the original document and the modified version because I work on a model to localize those forgeries 🧐 The available public datasets that exist are not sufficient but I believe it is possible to build one without resourting to synthetic datasets. In the french gazette website, organizations and funds are required to upload their financial reports every year and they are publicly available. If they make a mistake, the wrong document is left on the website for a while and a rectified document has to be uploaded. Now if the two versions match everywhere pixel to pixel except for a tiny portion, the it has only been modified digitaly and not rescanned. I have been able to find a few pair of documents like that be no nearly enough to train a model. Do you know any websites that work the same way? Where people upload pdfs and these pdfs are sometimes rectified and both versions are still online? Preferably free form pdf and not a specific form like the US gazette.
Thank you for your help!
submitted by /u/VegetableMistake5007
[link] [comments]