{"id":41541,"date":"2026-06-25T14:27:25","date_gmt":"2026-06-25T12:27:25","guid":{"rendered":"https:\/\/www.graviton.at\/letterswaplibrary\/i-processed-the-entire-arxiv-latex-source-corpus-3m-papers-into-a-metadata-aligned-parquet-dataset-to-save-on-s3-egress-fees\/"},"modified":"2026-06-25T14:27:25","modified_gmt":"2026-06-25T12:27:25","slug":"i-processed-the-entire-arxiv-latex-source-corpus-3m-papers-into-a-metadata-aligned-parquet-dataset-to-save-on-s3-egress-fees","status":"publish","type":"post","link":"https:\/\/www.graviton.at\/letterswaplibrary\/i-processed-the-entire-arxiv-latex-source-corpus-3m-papers-into-a-metadata-aligned-parquet-dataset-to-save-on-s3-egress-fees\/","title":{"rendered":"I Processed The Entire ArXiv LaTeX Source Corpus (3M+ Papers) Into A Metadata-aligned Parquet Dataset To Save On S3 Egress Fees"},"content":{"rendered":"<p><!-- SC_OFF --><\/p>\n<div class=\"md\">\n<p>I\u2019ve spent the last few weeks working on a pipeline to solve a problem that has frustrated me (and likely other researchers) for a while: working with arXiv source files at scale.<\/p>\n<p>If you have ever tried to analyze the LaTeX source code of arXiv papers, you have probably run into two major roadblocks:<\/p>\n<ol>\n<li><strong>The Egress Tax:<\/strong> arXiv\u2019s official bulk S3 bucket is configured as &#8220;requester-pays.&#8221; If you try to download the complete 5 TB corpus to any machine outside of the AWS <code>us-east-1<\/code> region, you get hit with standard egress fees. At $0.09 per GB, a single full download can cost over $450 in bandwidth alone.<\/li>\n<li><strong>Unpacking Pain:<\/strong> The raw S3 data is packaged as hundreds of nested <code>.tar<\/code> archives containing gzipped payloads of individual papers. Extracting these, parsing the inner LaTeX code, and matching the files with their JSON metadata snapshots is quite CPU-intensive and requires a lot of boilerplate ingestion code.<\/li>\n<\/ol>\n<p>To make this easier, I built a pipeline that runs inside AWS <code>us-east-1<\/code> (where transfer is free), pulls the raw source files, unpacks them, matches them with the official metadata, and bundles them into ready-to-query Parquet partitions.<\/p>\n<ul>\n<li><strong>HuggingFace Dataset Link:<\/strong> <a href=\"https:\/\/huggingface.co\/datasets\/scholarweave\/arxiv\">https:\/\/huggingface.co\/datasets\/scholarweave\/arxiv<\/a><\/li>\n<\/ul>\n<h1>What is inside:<\/h1>\n<p>Each row represents a single paper and contains both the official metadata and the parsed source files:<\/p>\n<ul>\n<li><strong>Core Metadata:<\/strong> <code>id<\/code>, <code>title<\/code>, <code>authors<\/code>, <code>abstract<\/code>, <code>doi<\/code>, <code>categories<\/code>, <code>license<\/code>, <code>versions<\/code>, etc.<\/li>\n<li><code>latex<\/code> <strong>(Large String):<\/strong> The parsed, compiled LaTeX source code from the paper. I wrote a parser to bundle the primary <code>.tex<\/code>, <code>.bib<\/code>, and <code>.sty<\/code> files into a single, readable Markdown-style tree structure.<\/li>\n<\/ul>\n<h1>Maintenance &amp; Syncing:<\/h1>\n<ul>\n<li><strong>Monthly Updates:<\/strong> I plan to sync the pipeline once a month to capture new uploads.<\/li>\n<li><strong>Resilient Syncing:<\/strong> I maintain an XML manifest file in the HuggingFace repository (<code>arxiv_parquet_manifest.xml<\/code>) that maps each Parquet partition to its size, MD5 checksum, and the raw S3 <code>.tar<\/code> source files used to generate it. This should make incremental syncing or troubleshooting much easier.<\/li>\n<\/ul>\n<p>If you are working on NLP, training LLMs on scientific text, analyzing citation networks, or doing sociolinguistic research, hopefully this saves you some time and cloud budget.<\/p>\n<\/div>\n<p><!-- SC_ON -->   submitted by   <a href=\"https:\/\/www.reddit.com\/user\/Invicto_50\"> \/u\/Invicto_50 <\/a> <br \/> <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1uf7i8w\/i_processed_the_entire_arxiv_latex_source_corpus\/\">[link]<\/a><\/span>   <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1uf7i8w\/i_processed_the_entire_arxiv_latex_source_corpus\/\">[comments]<\/a><\/span><\/p><div class='watch-action'><div class='watch-position align-right'><div class='action-like'><a class='lbg-style1 like-41541 jlk' href='javascript:void(0)' data-task='like' data-post_id='41541' data-nonce='72e055e984' rel='nofollow'><img class='wti-pixel' src='https:\/\/www.graviton.at\/letterswaplibrary\/wp-content\/plugins\/wti-like-post\/images\/pixel.gif' title='Like' \/><span class='lc-41541 lc'>0<\/span><\/a><\/div><\/div> <div class='status-41541 status align-right'><\/div><\/div><div class='wti-clear'><\/div>","protected":false},"excerpt":{"rendered":"<p>I\u2019ve spent the last few weeks working on a pipeline to solve a problem that has frustrated&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[85],"tags":[],"class_list":["post-41541","post","type-post","status-publish","format-standard","hentry","category-datatards","wpcat-85-id"],"_links":{"self":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/41541","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/comments?post=41541"}],"version-history":[{"count":0,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/41541\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/media?parent=41541"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/categories?post=41541"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/tags?post=41541"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}