20,000 Epstein Files In A Single Text File Available To Download (~100 MB)

I’ve processed all the text and image files (~25,000 document pages/emails) within individual folders released last friday into a two column text file. I used Googles tesseract OCR library to convert jpg to text.

You can download it here: https://huggingface.co/datasets/tensonaut/EPSTEIN_FILES_20K

For each document, I’ve included the full path to the original google drive folder from House oversight committee so you can link and verify contents. In using this dataset, please be sensitive to the privacy of the people involved (and remember that many of these people were certainly not involved in any of the actions which precipitated the investigation)

submitted by /u/tensonaut
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *