{"id":34278,"date":"2025-06-10T07:27:43","date_gmt":"2025-06-10T05:27:43","guid":{"rendered":"https:\/\/www.graviton.at\/letterswaplibrary\/self-promotion-i-processed-and-standardized-16-7tb-of-sec-filings\/"},"modified":"2025-06-10T07:27:43","modified_gmt":"2025-06-10T05:27:43","slug":"self-promotion-i-processed-and-standardized-16-7tb-of-sec-filings","status":"publish","type":"post","link":"https:\/\/www.graviton.at\/letterswaplibrary\/self-promotion-i-processed-and-standardized-16-7tb-of-sec-filings\/","title":{"rendered":"[self-promotion] I Processed And Standardized 16.7TB Of SEC Filings"},"content":{"rendered":"<p><!-- SC_OFF --><\/p>\n<div class=\"md\">\n<p>SEC data is submitted in a format called Standardized Generalized Markup Language. A SGML Submission may contain many different files. For example, this <a href=\"https:\/\/www.sec.gov\/Archives\/edgar\/data\/1467623\/000156218025003559\/0001562180-25-003559.txt\">Form 4<\/a> contains xml and txt files. This isn&#8217;t really important unless you want to work with a lot of data, e.g. the entire SEC corpus.<\/p>\n<p>If you do want to work with a lot of SEC data, your choice is either to buy the parsed SGML data or get it from the SEC&#8217;s website.<\/p>\n<p>Scraping the data is slow. The SEC rate limits you to 5 request per second for extended durations. There are about 16,000,000 submissions so this takes awhile. A much faster approach is to download the bulk data files <a href=\"https:\/\/www.sec.gov\/Archives\/edgar\/Feed\/\">here<\/a>. However, these files are in SGML form.<\/p>\n<p>I&#8217;ve written a fast <a href=\"https:\/\/github.com\/john-friedman\/secsgml\">SGML parser <\/a>here under the MIT License. The parser has been tested on the entire corpus, with &gt; 99.99% correctness. This is about as good as it gets, as the remaining errors are mostly due to issues on the SEC&#8217;s side. For example, some files have errors, especially in the pre 2001 years.<\/p>\n<p>Some stats about the corpus:<\/p>\n<table>\n<thead>\n<tr>\n<th align=\"left\">File Type<\/th>\n<th align=\"left\">Total Size (Bytes)<\/th>\n<th align=\"left\">File Count<\/th>\n<th align=\"left\">Average Size (Bytes)<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td align=\"left\">htm<\/td>\n<td align=\"left\">7,556,829,704,482<\/td>\n<td align=\"left\">39,626,124<\/td>\n<td align=\"left\">190,703.23<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">xml<\/td>\n<td align=\"left\">5,487,580,734,754<\/td>\n<td align=\"left\">12,126,942<\/td>\n<td align=\"left\">452,511.5<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">jpg<\/td>\n<td align=\"left\">1,760,575,964,313<\/td>\n<td align=\"left\">17,496,975<\/td>\n<td align=\"left\">100,621.73<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">pdf<\/td>\n<td align=\"left\">731,400,163,395<\/td>\n<td align=\"left\">279,577<\/td>\n<td align=\"left\">2,616,095.61<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">xls<\/td>\n<td align=\"left\">254,063,664,863<\/td>\n<td align=\"left\">152,410<\/td>\n<td align=\"left\">1,666,975.03<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">txt<\/td>\n<td align=\"left\">248,068,859,593<\/td>\n<td align=\"left\">4,049,227<\/td>\n<td align=\"left\">61,263.26<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">zip<\/td>\n<td align=\"left\">205,181,878,026<\/td>\n<td align=\"left\">863,723<\/td>\n<td align=\"left\">237,555.19<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">gif<\/td>\n<td align=\"left\">142,562,657,617<\/td>\n<td align=\"left\">2,620,069<\/td>\n<td align=\"left\">54,411.8<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">json<\/td>\n<td align=\"left\">129,268,309,455<\/td>\n<td align=\"left\">550,551<\/td>\n<td align=\"left\">234,798.06<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">xlsx<\/td>\n<td align=\"left\">41,434,461,258<\/td>\n<td align=\"left\">721,292<\/td>\n<td align=\"left\">57,444.78<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">xsd<\/td>\n<td align=\"left\">35,743,957,057<\/td>\n<td align=\"left\">832,307<\/td>\n<td align=\"left\">42,945.64<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">fil<\/td>\n<td align=\"left\">2,740,603,155<\/td>\n<td align=\"left\">109,453<\/td>\n<td align=\"left\">25,039.09<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">png<\/td>\n<td align=\"left\">2,528,666,373<\/td>\n<td align=\"left\">119,723<\/td>\n<td align=\"left\">21,120.97<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">css<\/td>\n<td align=\"left\">2,290,066,926<\/td>\n<td align=\"left\">855,781<\/td>\n<td align=\"left\">2,676.0<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">js<\/td>\n<td align=\"left\">1,277,196,859<\/td>\n<td align=\"left\">855,781<\/td>\n<td align=\"left\">1,492.43<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">html<\/td>\n<td align=\"left\">36,972,177<\/td>\n<td align=\"left\">584<\/td>\n<td align=\"left\">63,308.52<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">xfd<\/td>\n<td align=\"left\">9,600,700<\/td>\n<td align=\"left\">2,878<\/td>\n<td align=\"left\">3,335.89<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">paper<\/td>\n<td align=\"left\">2,195,962<\/td>\n<td align=\"left\">14,738<\/td>\n<td align=\"left\">149.0<\/td>\n<\/tr>\n<tr>\n<td align=\"left\">frm<\/td>\n<td align=\"left\">1,316,451<\/td>\n<td align=\"left\">417<\/td>\n<td align=\"left\">3,156.96<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><a href=\"https:\/\/github.com\/john-friedman\/secsgml\">The SGML parsing package<\/a>, <a href=\"https:\/\/github.com\/john-friedman\/SEC-Census\">Stats on processing the corpus,<\/a> <a href=\"https:\/\/github.com\/john-friedman\/datamule-python\">convenience package for SEC data<\/a>. <\/p>\n<\/div>\n<p><!-- SC_ON -->   submitted by   <a href=\"https:\/\/www.reddit.com\/user\/status-code-200\"> \/u\/status-code-200 <\/a> <br \/> <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1l7q7v1\/selfpromotion_i_processed_and_standardized_167tb\/\">[link]<\/a><\/span>   <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1l7q7v1\/selfpromotion_i_processed_and_standardized_167tb\/\">[comments]<\/a><\/span><\/p><div class='watch-action'><div class='watch-position align-right'><div class='action-like'><a class='lbg-style1 like-34278 jlk' href='javascript:void(0)' data-task='like' data-post_id='34278' data-nonce='65e0e39b87' rel='nofollow'><img class='wti-pixel' src='https:\/\/www.graviton.at\/letterswaplibrary\/wp-content\/plugins\/wti-like-post\/images\/pixel.gif' title='Like' \/><span class='lc-34278 lc'>0<\/span><\/a><\/div><\/div> <div class='status-34278 status align-right'><\/div><\/div><div class='wti-clear'><\/div>","protected":false},"excerpt":{"rendered":"<p>SEC data is submitted in a format called Standardized Generalized Markup Language. A SGML Submission may contain&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[85],"tags":[],"class_list":["post-34278","post","type-post","status-publish","format-standard","hentry","category-datatards","wpcat-85-id"],"_links":{"self":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/34278","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/comments?post=34278"}],"version-history":[{"count":0,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/34278\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/media?parent=34278"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/categories?post=34278"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/tags?post=34278"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}