[Dataset] Indic HPLT V1 & V2 — Large-scale Multilingual Pretraining Corpora For 14 Indic Languages + English (CC0)

I’ve released two large-scale multilingual pretraining datasets on Hugging Face, built from the HPLT v3 high-quality web crawl. Both are CC0 licensed (public domain) and ready to use with 🤗 Datasets.

📦 Indic HPLT v1

~9.8M documents | ~8.4B estimated tokens | 11 languages
🔗 https://huggingface.co/datasets/AM0908/indic-hplt-v1

Covers: Hindi, Bengali, Punjabi, Urdu, Tamil, Telugu, Marathi, Gujarati, Malayalam, Kannada, English

📦 Indic HPLT v2 (larger successor)

~34.6M documents | ~25.5B estimated tokens | 14 languages |
🔗 https://huggingface.co/datasets/AM0908/indic-hplt-v2

Adds Nepali, Odia, and Assamese on top of v1, with ~3.5× more documents overall.

🔧 How it was built

  • Source: HPLT v3 sorted shards (top-scoring documents by WDS quality score)
  • Quality filters: 50–100K chars/doc, max 50% non-alphabetic chars, min avg word length 2.0
  • Deduplication: SHA-256 exact dedup on all languages + MinHash LSH near-dedup on English (Jaccard ≥ 0.7)
  • Pipeline code: https://github.com/ashtok/multilingual-hplt-corpus

submitted by /u/ashtok897
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *