I’ve released two large-scale multilingual pretraining datasets on Hugging Face, built from the HPLT v3 high-quality web crawl. Both are CC0 licensed (public domain) and ready to use with 🤗 Datasets.
📦 Indic HPLT v1
~9.8M documents | ~8.4B estimated tokens | 11 languages
🔗 https://huggingface.co/datasets/AM0908/indic-hplt-v1
Covers: Hindi, Bengali, Punjabi, Urdu, Tamil, Telugu, Marathi, Gujarati, Malayalam, Kannada, English
📦 Indic HPLT v2 (larger successor)
~34.6M documents | ~25.5B estimated tokens | 14 languages |
🔗 https://huggingface.co/datasets/AM0908/indic-hplt-v2
Adds Nepali, Odia, and Assamese on top of v1, with ~3.5× more documents overall.
🔧 How it was built
- Source: HPLT v3 sorted shards (top-scoring documents by WDS quality score)
- Quality filters: 50–100K chars/doc, max 50% non-alphabetic chars, min avg word length 2.0
- Deduplication: SHA-256 exact dedup on all languages + MinHash LSH near-dedup on English (Jaccard ≥ 0.7)
- Pipeline code: https://github.com/ashtok/multilingual-hplt-corpus
submitted by /u/ashtok897
[link] [comments]