20M+ Indian Court Cases – Structured Metadata, Citation Graphs, Vector Embeddings (API + Bulk Export)

I spent 6 years indexing Indian court cases from the Supreme Court, all 25 High Courts, and 14 Tribunals. Sharing because I haven’t seen a structured Indian legal dataset at this scale anywhere.

What’s in it:

– 20M+ cases with pdf, structured metadata (court, bench, date, parties, sections cited, acts referenced, case type, headnotes)

– Citation graph across the full corpus (which case cites, follows, distinguishes, or overrules which)

– 23,122 Indian Acts and Statutes (Central, State, Regulatory) with full text and amendment tracking

– Vector embeddings (Voyage AI, 1024d) for every case

– Bilingual legal translation pairs across 11 Indian languages (Hindi, Tamil, Telugu, Bangla, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Odia, Urdu) paired with English

For context: India has the world’s largest common law system.

40M+ pending cases. Court judgments are public domain under Indian law (no copyright on judicial decisions). But the raw data is scattered across 25+ different court websites, each with different formats, and many orders are scanned image PDFs with no searchable text.

Available as:

– REST API (sub-500ms hybrid semantic + keyword search)

– Bulk export (JSON / Parquet)

– Vector search via Qdrant

The bilingual legal translation pairs might be interesting for NLP

researchers working on low-resource Indian languages. Legal text is formal register with precise terminology, which is hard to find in most Indian language corpora.

Details: vaquill ai

Happy to answer questions about the data collection process, schema, or coverage gaps.

submitted by /u/zriyansh
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *