{"id":31890,"date":"2024-12-14T09:28:24","date_gmt":"2024-12-14T08:28:24","guid":{"rendered":"https:\/\/www.graviton.at\/letterswaplibrary\/institutional-data-initiative-plans-to-release-a-dataset-5-times-that-of-book3-in-early-2025\/"},"modified":"2024-12-14T09:28:24","modified_gmt":"2024-12-14T08:28:24","slug":"institutional-data-initiative-plans-to-release-a-dataset-5-times-that-of-book3-in-early-2025","status":"publish","type":"post","link":"https:\/\/www.graviton.at\/letterswaplibrary\/institutional-data-initiative-plans-to-release-a-dataset-5-times-that-of-book3-in-early-2025\/","title":{"rendered":"Institutional Data Initiative Plans To Release A Dataset &#8220;5 Times That Of Book3&#8221; In Early 2025"},"content":{"rendered":"<p><!-- SC_OFF --><\/p>\n<div class=\"md\">\n<p><a href=\"https:\/\/institutionaldatainitiative.org\/\">https:\/\/institutionaldatainitiative.org\/<\/a><\/p>\n<p><a href=\"https:\/\/www.wired.com\/story\/harvard-ai-training-dataset-openai-microsoft\/\">https:\/\/www.wired.com\/story\/harvard-ai-training-dataset-openai-microsoft\/<\/a><\/p>\n<p>Harvard University announced Thursday it\u2019s releasing a high-quality dataset of nearly one million public-domain books that could be used by anyone to train large language models and other AI tools. The dataset was created by Harvard\u2019s newly formed Institutional Data Initiative with funding from both Microsoft and OpenAI. It contains books scanned as part of the Google Books project that are no longer protected by copyright&#8230; with classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries&#8230; In addition to the trove of books, the Institutional Data Initiative is also working with the Boston Public Library to scan millions of articles from different newspapers now in the public domain, and it says it\u2019s open to forming similar collaborations down the line.<\/p>\n<\/div>\n<p><!-- SC_ON -->   submitted by   <a href=\"https:\/\/www.reddit.com\/user\/furrypony2718\"> \/u\/furrypony2718 <\/a> <br \/> <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1hdy93f\/institutional_data_initiative_plans_to_release_a\/\">[link]<\/a><\/span>   <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1hdy93f\/institutional_data_initiative_plans_to_release_a\/\">[comments]<\/a><\/span><\/p><div class='watch-action'><div class='watch-position align-right'><div class='action-like'><a class='lbg-style1 like-31890 jlk' href='javascript:void(0)' data-task='like' data-post_id='31890' data-nonce='bc39e8310e' rel='nofollow'><img class='wti-pixel' src='https:\/\/www.graviton.at\/letterswaplibrary\/wp-content\/plugins\/wti-like-post\/images\/pixel.gif' title='Like' \/><span class='lc-31890 lc'>0<\/span><\/a><\/div><\/div> <div class='status-31890 status align-right'><\/div><\/div><div class='wti-clear'><\/div>","protected":false},"excerpt":{"rendered":"<p>https:\/\/institutionaldatainitiative.org\/ https:\/\/www.wired.com\/story\/harvard-ai-training-dataset-openai-microsoft\/ Harvard University announced Thursday it\u2019s releasing a high-quality dataset of nearly one million public-domain books&#8230;<\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[85],"tags":[],"class_list":["post-31890","post","type-post","status-publish","format-standard","hentry","category-datatards","wpcat-85-id"],"_links":{"self":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/31890","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/comments?post=31890"}],"version-history":[{"count":0,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/31890\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/media?parent=31890"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/categories?post=31890"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/tags?post=31890"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}