[OC] Usenet Corpus 1980–2013 — 103B Tokens, 408M Posts, 9 Hierarchies, Fully Processed

Shared this on r/MachineLearning a few days ago and got good discussion (30K views, 100+ upvotes) — figured this community would want to know about it too since it’s more directly relevant here.

I’ve spent the last several years building and processing a complete Usenet corpus and finally have it documented well enough to share properly.

What it is: A deduplicated, sanitized collection of Usenet posts from 1980 through 2013 — covering the full arc of Usenet from its academic origins through peak adoption to decline. Pre-web, pre-social media, pre-AI. Entirely human-generated.

Stats:

  • 103.1 billion tokens (cl100k_base)
  • 408,236,288 posts
  • 18,347 newsgroups
  • 9 top-level hierarchies: alt, rec, comp, soc, sci, misc, news, talk, humanities

Processing applied:

  • alt.binaries.* excluded entirely at hierarchy level (UUencoded/base64 binary content)
  • Adult content newsgroups excluded at hierarchy level
  • Record-level: deduplication by Message-ID, binary detection and removal, PII redaction (email addresses replaced with [email] token, Message-IDs SHA-256 hashed), sensitive content removal
  • Language detection on every record (fasttext LID-176) — 96.6% English, 100+ languages total
  • Format: gzip-compressed JSONL, ~141GB compressed

Schema:

{ "text": "post body", "group": "comp.lang.python", "date": "1995-03-14", "subject": "Re: thread subject", "author": "Display Name", "id": "msg-<sha256hex>" } 

Samples: 11 sample files (5K posts per hierarchy + combined sets) are freely available — no approval needed. Full corpus available for licensing.

Dataset has also been added to the AI datasets directory at lifearchitect.ai/datasets-table.

Link in comments.

submitted by /u/OwnerByDane
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *