AI Books4 Dataset For Training LLMs Further

What?

More than 400,000 fiction and non-fiction book full-texts. Multiple languages, curated, deduplicated.

More than 6,000,000 scholarly publications, magazines, and manuals full-texts. Multiple languages, curated, deduplicated.

150,000,000 metadata records

Format

Zstd compressed file, JSON lines, one per book/publication.

abstract, content – description and content in markdown format

issued_at – time of issuing of the object (not of the record itself)

metadata – ISBNs, publishers, series etc

id – identifier in external systems, if applicable (i.e. DOI)

other fields should be self-descriptive

Download:

magnet:?xt=urn:btih:a904e660355c49006b2e7d43893d31bf3c2be9cc&dn=libstc2.jsonl.zst&tr=udp://tracker.opentrackr.org:1337/announce&tr=https://tracker1.ctix.cn:443/announce&tr=udp://open.demonii.com:1337/announce

submitted by /u/JohnTheMelancholic
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *