{"id":25024,"date":"2024-01-01T04:28:29","date_gmt":"2024-01-01T03:28:29","guid":{"rendered":"https:\/\/www.graviton.at\/letterswaplibrary\/part-synthetic-generative-ai-for-math-part-i-mathpile-a-billion-token-scale-pretraining-corpus-for-math\/"},"modified":"2024-01-01T04:28:29","modified_gmt":"2024-01-01T03:28:29","slug":"part-synthetic-generative-ai-for-math-part-i-mathpile-a-billion-token-scale-pretraining-corpus-for-math","status":"publish","type":"post","link":"https:\/\/www.graviton.at\/letterswaplibrary\/part-synthetic-generative-ai-for-math-part-i-mathpile-a-billion-token-scale-pretraining-corpus-for-math\/","title":{"rendered":"[Part-Synthetic] &#8220;Generative AI For Math: Part I &#8212; MathPile: A Billion-Token-Scale Pretraining Corpus For Math&#8221;"},"content":{"rendered":"<p><!-- SC_OFF --><\/p>\n<div class=\"md\">\n<p><strong>Paper<\/strong>: <a href=\"https:\/\/arxiv.org\/abs\/2312.17120\">https:\/\/arxiv.org\/abs\/2312.17120<\/a><\/p>\n<p><strong>Datasets<\/strong>: <a href=\"https:\/\/huggingface.co\/datasets\/GAIR\/MathPile\">https:\/\/huggingface.co\/datasets\/GAIR\/MathPile<\/a><\/p>\n<p><strong>Code<\/strong>: <a href=\"https:\/\/github.com\/GAIR-NLP\/MathPile\/\">https:\/\/github.com\/GAIR-NLP\/MathPile\/<\/a><\/p>\n<p><strong>Project page<\/strong>: <a href=\"https:\/\/gair-nlp.github.io\/MathPile\/\">https:\/\/gair-nlp.github.io\/MathPile\/<\/a><\/p>\n<p><strong>Abstract<\/strong>:<\/p>\n<p>High-quality, large-scale corpora are the cornerstone of building foundation models. In this work, we introduce MathPile, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. Throughout its creation, we adhered to the principle of &#8220;less is more&#8221;, firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. Furthermore, we performed data contamination detection on downstream benchmark test sets to eliminate duplicates. We hope our MathPile can help to enhance the mathematical reasoning abilities of language models. We plan to open-source different versions of MathPile with the scripts used for processing, to facilitate future developments in this field.<\/p>\n<\/div>\n<p><!-- SC_ON -->   submitted by   <a href=\"https:\/\/www.reddit.com\/user\/APaperADay\"> \/u\/APaperADay <\/a> <br \/> <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/18vn7ki\/partsynthetic_generative_ai_for_math_part_i\/\">[link]<\/a><\/span>   <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/18vn7ki\/partsynthetic_generative_ai_for_math_part_i\/\">[comments]<\/a><\/span><\/p><div class='watch-action'><div class='watch-position align-right'><div class='action-like'><a class='lbg-style1 like-25024 jlk' href='javascript:void(0)' data-task='like' data-post_id='25024' data-nonce='614a020375' rel='nofollow'><img class='wti-pixel' src='https:\/\/www.graviton.at\/letterswaplibrary\/wp-content\/plugins\/wti-like-post\/images\/pixel.gif' title='Like' \/><span class='lc-25024 lc'>0<\/span><\/a><\/div><\/div> <div class='status-25024 status align-right'><\/div><\/div><div class='wti-clear'><\/div>","protected":false},"excerpt":{"rendered":"<p>Paper: https:\/\/arxiv.org\/abs\/2312.17120 Datasets: https:\/\/huggingface.co\/datasets\/GAIR\/MathPile Code: https:\/\/github.com\/GAIR-NLP\/MathPile\/ Project page: https:\/\/gair-nlp.github.io\/MathPile\/ Abstract: High-quality, large-scale corpora are the cornerstone of&#8230;<\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[85],"tags":[],"class_list":["post-25024","post","type-post","status-publish","format-standard","hentry","category-datatards","wpcat-85-id"],"_links":{"self":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/25024","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/comments?post=25024"}],"version-history":[{"count":0,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/25024\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/media?parent=25024"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/categories?post=25024"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/tags?post=25024"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}