Built this as part of a multilingual pretraining research project. Figured I’d share it here.
European HPLT v1 — quality-filtered from HPLT v3 web crawl data:
45M documents across 41 European languages (Germanic, Romance, Slavic, Celtic, Baltic, Finno-Ugric + more
~50.9B estimated tokens, ~190 GB raw JSONL
Every doc has a WDS quality score of 10 or higher — exact SHA-256 deduplication applied
Per-document metadata: language, URL, quality score, register/genre tag, char/word count
CC0 1.0 license — fully open, inherited from HPLT v3
Covers lower-resource languages (Maltese, Faroese, Scottish Gaelic, Occitan, Luxembourgish, Irish, Asturian) that are underrepresented in OSCAR and CulturaX.
HuggingFace: huggingface.co/datasets/ashtok897/european-hplt-v1
submitted by /u/ashtok897
[link] [comments]