“fineweb”: 15t Tokens Of Cleaned Common Crawl Webtext Since 2013 (extracted From WARC, Not WET), Beats Pile Etc submitted by /u/gwern [link] [comments]0