Zero-touch pipeline + explorer for a subset of the Epstein-related DOJ PDF release (hashed, restart-safe, source-path traceable)

I ran an end-to-end preprocess on a subset of the Epstein-related files from the DOJ PDF release I downloaded (not claiming completeness). The goal is corpus exploration + provenance, not “truth,” and not perfect extraction.

Explorer: https://huggingface.co/spaces/cjc0013/epstein-corpus-explorer

Raw dataset artifacts (so you can validate / build your own tooling): https://huggingface.co/datasets/cjc0013/epsteindataset/tree/main

What I did

1) Ingest + hashing (deterministic identity)

Input: /content/TEXT (directory)
Files hashed: 331,655
Everything is hashed so runs have a stable identity and you can detect changes.
Every chunk includes a source_file path so you can map a chunk back to the exact file you downloaded (i.e., your local DOJ dump on disk). This is for auditability.

2) Text extraction from PDFs (NO OCR)

I did not run OCR.

Reason: the PDFs had selectable/highlightable text, so there’s already a text layer. OCR would mostly add noise.

Caveat: extraction still isn’t perfect because redactions can disrupt the PDF text layer, even when text is highlightable. So you may see:

missing spans
duplicated fragments
out-of-order text
odd tokens where redaction overlays cut across lines

I kept extraction as close to “normal” as possible (no reconstruction / no guessing redacted content). This is meant for exploration, not as an authoritative transcript.

3) Chunking

Output chunks: 489,734
Stored with stable IDs + ordering + source path provenance.

4) Embeddings

Model: BAAI/bge-large-en-v1.5
embeddings.npy shape (489,734, 1024) float32

5) BM25 artifacts

bm25_stats.parquet
bm25_vocab.parquet
Full BM25 index object skipped at this scale (chunk_count > 50k), but vocab/stats are written.

6) Clustering (scale-aware)

HDBSCAN at ~490k points can take a very long time and is largely CPU-bound, so at large N the pipeline auto-switches to:

PCA → 64 dims
MiniBatchKMeans This completed cleanly.

7) Restart-safe / resume

If the runtime dies or I stop it, rerunning reuses valid artifacts (chunks/BM25/embeddings) instead of redoing multi-hour work.

Outputs produced

chunks.parquet (chunk_id, order_index, doc_id, source_file, text)
embeddings.npy
cluster_labels.parquet (chunk_id, cluster_id, cluster_prob)
bm25_stats.parquet
bm25_vocab.parquet
fused_chunks.jsonl
preprocess_report.json

Quick note on “quality” / bugs

I’m not a data scientist and I’m not claiming this is bug-free — including the Hugging Face explorer itself. That’s why I’m also publishing the raw artifacts so anyone can audit the pipeline outputs, rebuild the index, or run their own analysis from scratch: https://huggingface.co/datasets/cjc0013/epsteindataset/tree/main

What this is / isn’t

Not claiming perfect extraction (redactions can corrupt the text layer even without OCR).
Not claiming completeness (subset only).
Is deterministic + hashed + traceable back to source file locations for auditing.

submitted by /u/Either_Pound1986
[link] [comments]

Zero-touch Pipeline + Explorer For A Subset Of The Epstein-related DOJ PDF Release (hashed, Restart-safe, Source-path Traceable)

What I did

1) Ingest + hashing (deterministic identity)

2) Text extraction from PDFs (NO OCR)

3) Chunking

4) Embeddings

5) BM25 artifacts

6) Clustering (scale-aware)

7) Restart-safe / resume

Outputs produced

Quick note on “quality” / bugs

What this is / isn’t

Leave a Reply Cancel reply

Recent Posts

Recent Comments

18+ Content

What I did

1) Ingest + hashing (deterministic identity)

2) Text extraction from PDFs (NO OCR)

3) Chunking

4) Embeddings

5) BM25 artifacts

6) Clustering (scale-aware)

7) Restart-safe / resume

Outputs produced

Quick note on “quality” / bugs

What this is / isn’t

Leave a Reply Cancel reply

Recent Posts

Recent Comments