{"id":36642,"date":"2025-11-18T20:27:14","date_gmt":"2025-11-18T19:27:14","guid":{"rendered":"https:\/\/www.graviton.at\/letterswaplibrary\/cleaned-structured-the-nov-2025-epstein-email-dump-into-a-single-jsonl-9966-entries-semantic-explorer-huggingface\/"},"modified":"2025-11-18T20:27:14","modified_gmt":"2025-11-18T19:27:14","slug":"cleaned-structured-the-nov-2025-epstein-email-dump-into-a-single-jsonl-9966-entries-semantic-explorer-huggingface","status":"publish","type":"post","link":"https:\/\/www.graviton.at\/letterswaplibrary\/cleaned-structured-the-nov-2025-epstein-email-dump-into-a-single-jsonl-9966-entries-semantic-explorer-huggingface\/","title":{"rendered":"Cleaned + Structured The Nov 2025 Epstein Email Dump Into A Single JSONL (9966 Entries) + Semantic Explorer [HuggingFace]"},"content":{"rendered":"<p><!-- SC_OFF --><\/p>\n<div class=\"md\">\n<p>A few days after the Nov 12th 2025 Epstein email dump went public, I pulled all the individual text files together, cleaned them, removed duplicates, and converted everything into a single standardized .jsonl dataset.<\/p>\n<p>No PDFs, no images \u2014 this is text-only. The raw dump wasn\u2019t structured: filenames were random, topics weren\u2019t grouped, and keyword search barely worked. Names weren\u2019t consistent, related passages didn\u2019t use the same vocabulary, and there was no way to browse by theme.<\/p>\n<p>So I built a structured version:<\/p>\n<pre><code>merged everything into one JSONL file each line = one JSON object (9966 total entries) cleaned formatting + removed noise chunked text properly grouped the dataset into clusters (topic-based) added BM25 keyword search added simple topic-term extraction added entity search made a lightweight explorer UI on HuggingFace <\/code><\/pre>\n<p>\ud83d\udd17 HuggingFace explorer + dataset:<\/p>\n<p><a href=\"https:\/\/huggingface.co\/spaces\/cjc0013\/epstein-semantic-explorer\">https:\/\/huggingface.co\/spaces\/cjc0013\/epstein-semantic-explorer<\/a><\/p>\n<p>JSONL structure (one entry per line):<\/p>\n<p>json {&#8220;id&#8221;: 123, &#8220;cluster&#8221;: 47, &#8220;text&#8221;: &#8220;&#8230;&#8221;} What you can do in the explorer:<\/p>\n<pre><code>Browse clusters by topic Run BM25 keyword search Search entities (names\/places\/orgs) View cluster summaries See top terms Upload your own JSONL to reuse the explorer for any dataset <\/code><\/pre>\n<p>This is not commentary \u2014 just a structured dataset + tools for anyone who wants to analyze the dump more efficiently.<\/p>\n<p>Please let me know if you encounter any errors. Will answer any questions about the datasets construction.<\/p>\n<\/div>\n<p><!-- SC_ON -->   submitted by   <a href=\"https:\/\/www.reddit.com\/user\/Either_Pound1986\"> \/u\/Either_Pound1986 <\/a> <br \/> <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1p0ktjz\/cleaned_structured_the_nov_2025_epstein_email\/\">[link]<\/a><\/span>   <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1p0ktjz\/cleaned_structured_the_nov_2025_epstein_email\/\">[comments]<\/a><\/span><\/p><div class='watch-action'><div class='watch-position align-right'><div class='action-like'><a class='lbg-style1 like-36642 jlk' href='javascript:void(0)' data-task='like' data-post_id='36642' data-nonce='65e0e39b87' rel='nofollow'><img class='wti-pixel' src='https:\/\/www.graviton.at\/letterswaplibrary\/wp-content\/plugins\/wti-like-post\/images\/pixel.gif' title='Like' \/><span class='lc-36642 lc'>0<\/span><\/a><\/div><\/div> <div class='status-36642 status align-right'><\/div><\/div><div class='wti-clear'><\/div>","protected":false},"excerpt":{"rendered":"<p>A few days after the Nov 12th 2025 Epstein email dump went public, I pulled all the&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[85],"tags":[],"class_list":["post-36642","post","type-post","status-publish","format-standard","hentry","category-datatards","wpcat-85-id"],"_links":{"self":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/36642","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/comments?post=36642"}],"version-history":[{"count":0,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/36642\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/media?parent=36642"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/categories?post=36642"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/tags?post=36642"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}