{"id":38545,"date":"2026-02-01T05:27:04","date_gmt":"2026-02-01T04:27:04","guid":{"rendered":"https:\/\/www.graviton.at\/letterswaplibrary\/zero-touch-pipeline-explorer-for-a-subset-of-the-epstein-related-doj-pdf-release-hashed-restart-safe-source-path-traceable\/"},"modified":"2026-02-01T05:27:04","modified_gmt":"2026-02-01T04:27:04","slug":"zero-touch-pipeline-explorer-for-a-subset-of-the-epstein-related-doj-pdf-release-hashed-restart-safe-source-path-traceable","status":"publish","type":"post","link":"https:\/\/www.graviton.at\/letterswaplibrary\/zero-touch-pipeline-explorer-for-a-subset-of-the-epstein-related-doj-pdf-release-hashed-restart-safe-source-path-traceable\/","title":{"rendered":"Zero-touch Pipeline + Explorer For A Subset Of The Epstein-related DOJ PDF Release (hashed, Restart-safe, Source-path Traceable)"},"content":{"rendered":"<p><!-- SC_OFF --><\/p>\n<div class=\"md\">\n<p>I ran an end-to-end preprocess on a <strong>subset<\/strong> of the Epstein-related files from the DOJ PDF release I downloaded (not claiming completeness). The goal is <strong>corpus exploration + provenance<\/strong>, not \u201ctruth,\u201d and not perfect extraction.<\/p>\n<p>Explorer: <a href=\"https:\/\/huggingface.co\/spaces\/cjc0013\/epstein-corpus-explorer\">https:\/\/huggingface.co\/spaces\/cjc0013\/epstein-corpus-explorer<\/a><\/p>\n<p>Raw dataset artifacts (so you can validate \/ build your own tooling): <a href=\"https:\/\/huggingface.co\/datasets\/cjc0013\/epsteindataset\/tree\/main\">https:\/\/huggingface.co\/datasets\/cjc0013\/epsteindataset\/tree\/main<\/a><\/p>\n<hr \/>\n<h2>What I did<\/h2>\n<h3>1) Ingest + hashing (deterministic identity)<\/h3>\n<ul>\n<li>Input: <code>\/content\/TEXT<\/code> (directory)<\/li>\n<li>Files hashed: <strong>331,655<\/strong><\/li>\n<li>Everything is hashed so runs have a stable identity and you can detect changes.<\/li>\n<li>Every chunk includes a <code>source_file<\/code> path so you can map a chunk back to the exact file you downloaded (i.e., your local DOJ dump on disk). This is for auditability.<\/li>\n<\/ul>\n<h3>2) Text extraction from PDFs (NO OCR)<\/h3>\n<p>I <strong>did not run OCR<\/strong>.<\/p>\n<p>Reason: the PDFs had <strong>selectable\/highlightable text<\/strong>, so there\u2019s already a text layer. OCR would mostly add noise.<\/p>\n<p>Caveat: extraction still isn\u2019t perfect because <strong>redactions can disrupt the PDF text layer<\/strong>, even when text is highlightable. So you may see:<\/p>\n<ul>\n<li>missing spans<\/li>\n<li>duplicated fragments<\/li>\n<li>out-of-order text<\/li>\n<li>odd tokens where redaction overlays cut across lines<\/li>\n<\/ul>\n<p>I kept extraction as close to \u201cnormal\u201d as possible (no reconstruction \/ no guessing redacted content). This is meant for <strong>exploration<\/strong>, not as an authoritative transcript.<\/p>\n<h3>3) Chunking<\/h3>\n<ul>\n<li>Output chunks: <strong>489,734<\/strong><\/li>\n<li>Stored with stable IDs + ordering + source path provenance.<\/li>\n<\/ul>\n<h3>4) Embeddings<\/h3>\n<ul>\n<li>Model: <code>BAAI\/bge-large-en-v1.5<\/code><\/li>\n<li><code>embeddings.npy<\/code> shape <strong>(489,734, 1024)<\/strong> float32<\/li>\n<\/ul>\n<h3>5) BM25 artifacts<\/h3>\n<ul>\n<li><code>bm25_stats.parquet<\/code><\/li>\n<li><code>bm25_vocab.parquet<\/code><\/li>\n<li>Full BM25 index object skipped at this scale (chunk_count &gt; 50k), but vocab\/stats are written.<\/li>\n<\/ul>\n<h3>6) Clustering (scale-aware)<\/h3>\n<p>HDBSCAN at ~490k points can take a very long time and is largely CPU-bound, so at large N the pipeline auto-switches to:<\/p>\n<ul>\n<li>PCA \u2192 64 dims<\/li>\n<li>MiniBatchKMeans This completed cleanly.<\/li>\n<\/ul>\n<h3>7) Restart-safe \/ resume<\/h3>\n<p>If the runtime dies or I stop it, rerunning reuses valid artifacts (chunks\/BM25\/embeddings) instead of redoing multi-hour work.<\/p>\n<hr \/>\n<h2>Outputs produced<\/h2>\n<ul>\n<li><code>chunks.parquet<\/code> (chunk_id, order_index, doc_id, source_file, text)<\/li>\n<li><code>embeddings.npy<\/code><\/li>\n<li><code>cluster_labels.parquet<\/code> (chunk_id, cluster_id, cluster_prob)<\/li>\n<li><code>bm25_stats.parquet<\/code><\/li>\n<li><code>bm25_vocab.parquet<\/code><\/li>\n<li><code>fused_chunks.jsonl<\/code><\/li>\n<li><code>preprocess_report.json<\/code><\/li>\n<\/ul>\n<hr \/>\n<h2>Quick note on \u201cquality\u201d \/ bugs<\/h2>\n<p>I\u2019m not a data scientist and I\u2019m not claiming this is bug-free \u2014 including the Hugging Face explorer itself. That\u2019s why I\u2019m also publishing the <strong>raw artifacts<\/strong> so anyone can audit the pipeline outputs, rebuild the index, or run their own analysis from scratch: <a href=\"https:\/\/huggingface.co\/datasets\/cjc0013\/epsteindataset\/tree\/main\">https:\/\/huggingface.co\/datasets\/cjc0013\/epsteindataset\/tree\/main<\/a><\/p>\n<hr \/>\n<h2>What this is \/ isn\u2019t<\/h2>\n<ul>\n<li>Not claiming perfect extraction (redactions can corrupt the text layer even without OCR).<\/li>\n<li>Not claiming completeness (subset only).<\/li>\n<li><strong>Is<\/strong> deterministic + hashed + traceable back to source file locations for auditing.<\/li>\n<\/ul><\/div>\n<p><!-- SC_ON -->   submitted by   <a href=\"https:\/\/www.reddit.com\/user\/Either_Pound1986\"> \/u\/Either_Pound1986 <\/a> <br \/> <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1qsoffb\/zerotouch_pipeline_explorer_for_a_subset_of_the\/\">[link]<\/a><\/span>   <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1qsoffb\/zerotouch_pipeline_explorer_for_a_subset_of_the\/\">[comments]<\/a><\/span><\/p><div class='watch-action'><div class='watch-position align-right'><div class='action-like'><a class='lbg-style1 like-38545 jlk' href='javascript:void(0)' data-task='like' data-post_id='38545' data-nonce='bc39e8310e' rel='nofollow'><img class='wti-pixel' src='https:\/\/www.graviton.at\/letterswaplibrary\/wp-content\/plugins\/wti-like-post\/images\/pixel.gif' title='Like' \/><span class='lc-38545 lc'>0<\/span><\/a><\/div><\/div> <div class='status-38545 status align-right'><\/div><\/div><div class='wti-clear'><\/div>","protected":false},"excerpt":{"rendered":"<p>I ran an end-to-end preprocess on a subset of the Epstein-related files from the DOJ PDF release&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[85],"tags":[],"class_list":["post-38545","post","type-post","status-publish","format-standard","hentry","category-datatards","wpcat-85-id"],"_links":{"self":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/38545","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/comments?post=38545"}],"version-history":[{"count":0,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/38545\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/media?parent=38545"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/categories?post=38545"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/tags?post=38545"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}