Shared this on r/MachineLearning a few days ago and got good discussion (30K views, 100+ upvotes) — figured this community would want to know about it too since it’s more directly relevant here.
I’ve spent the last several years building and processing a complete Usenet corpus and finally have it documented well enough to share properly.
What it is: A deduplicated, sanitized collection of Usenet posts from 1980 through 2013 — covering the full arc of Usenet from its academic origins through peak adoption to decline. Pre-web, pre-social media, pre-AI. Entirely human-generated.
Stats:
- 103.1 billion tokens (cl100k_base)
- 408,236,288 posts
- 18,347 newsgroups
- 9 top-level hierarchies: alt, rec, comp, soc, sci, misc, news, talk, humanities
Processing applied:
- alt.binaries.* excluded entirely at hierarchy level (UUencoded/base64 binary content)
- Adult content newsgroups excluded at hierarchy level
- Record-level: deduplication by Message-ID, binary detection and removal, PII redaction (email addresses replaced with [email] token, Message-IDs SHA-256 hashed), sensitive content removal
- Language detection on every record (fasttext LID-176) — 96.6% English, 100+ languages total
- Format: gzip-compressed JSONL, ~141GB compressed
Schema:
{ "text": "post body", "group": "comp.lang.python", "date": "1995-03-14", "subject": "Re: thread subject", "author": "Display Name", "id": "msg-<sha256hex>" }
Samples: 11 sample files (5K posts per hierarchy + combined sets) are freely available — no approval needed. Full corpus available for licensing.
Dataset has also been added to the AI datasets directory at lifearchitect.ai/datasets-table.
Link in comments.
submitted by /u/OwnerByDane
[link] [comments]