{"id":36323,"date":"2025-11-02T21:27:09","date_gmt":"2025-11-02T20:27:09","guid":{"rendered":"https:\/\/www.graviton.at\/letterswaplibrary\/p-training-better-llms-with-30-less-data-entropy-based-data-distillation\/"},"modified":"2025-11-02T21:27:09","modified_gmt":"2025-11-02T20:27:09","slug":"p-training-better-llms-with-30-less-data-entropy-based-data-distillation","status":"publish","type":"post","link":"https:\/\/www.graviton.at\/letterswaplibrary\/p-training-better-llms-with-30-less-data-entropy-based-data-distillation\/","title":{"rendered":"[P] Training Better LLMs With 30% Less Data \u2013 Entropy-Based Data Distillation"},"content":{"rendered":"<p><!-- SC_OFF --><\/p>\n<div class=\"md\">\n<h1>I&#8217;ve been experimenting with data-efficient LLM training as part of a project I&#8217;m calling Oren, focused on entropy-based dataset filtering.<\/h1>\n<p>The philosophy behind this emerged from knowledge distillation pipelines, where student models basically inherit the same limitations of intelligence as the teacher models have. Thus, the goal of Oren is to change LLM training completely \u2013 from the current frontier approach of rapidly upscaling in compute and GPU hours to a new strategy: optimizing training datasets for smaller, smarter models.<\/p>\n<p>The experimentation setup: two identical 100M-parameter language models.<\/p>\n<ul>\n<li><strong>Model A:<\/strong> trained on 700M raw tokens<\/li>\n<li><strong>Model B:<\/strong> trained on the top 70% of samples (500M tokens) selected via entropy-based filtering<\/li>\n<\/ul>\n<p><strong>Result:<\/strong> Model B matched Model A in performance, while using 30% less data, time, and compute. No architecture or hyperparameter changes.<\/p>\n<p>Open-source models:<\/p>\n<p>\ud83e\udd17 <a href=\"https:\/\/huggingface.co\/vitalune\/nanochat-d10-raw-700m\">Model A &#8211; Raw (700M tokens)<\/a><\/p>\n<p>\ud83e\udd17 <a href=\"https:\/\/huggingface.co\/vitalune\/nanochat-d10-filtered-500m\">Model B &#8211; Filtered (500M tokens)<\/a><\/p>\n<p>Full documentation:<\/p>\n<p>\ud83d\udc7e<a href=\"https:\/\/github.com\/vitalune\/Oren\">GitHub Repository<\/a><\/p>\n<p>I&#8217;d love feedback, especially on how to generalize this into a reusable pipeline that can be directly applied onto LLMs before training and\/or fine-tuning\u2013I&#8217;m currently thinking of a multi-agent system, with each agent being a SLM trained on a subdomain (i.e., coding, math, science), each with their own scoring metrics. Would love feedback from anyone here who has tried entropy or loss-based filtering and possibly even scaled it<\/p>\n<\/div>\n<p><!-- SC_ON -->   submitted by   <a href=\"https:\/\/www.reddit.com\/user\/Jolly-Act9349\"> \/u\/Jolly-Act9349 <\/a> <br \/> <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1omqfrz\/p_training_better_llms_with_30_less_data\/\">[link]<\/a><\/span>   <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1omqfrz\/p_training_better_llms_with_30_less_data\/\">[comments]<\/a><\/span><\/p><div class='watch-action'><div class='watch-position align-right'><div class='action-like'><a class='lbg-style1 like-36323 jlk' href='javascript:void(0)' data-task='like' data-post_id='36323' data-nonce='bc39e8310e' rel='nofollow'><img class='wti-pixel' src='https:\/\/www.graviton.at\/letterswaplibrary\/wp-content\/plugins\/wti-like-post\/images\/pixel.gif' title='Like' \/><span class='lc-36323 lc'>0<\/span><\/a><\/div><\/div> <div class='status-36323 status align-right'><\/div><\/div><div class='wti-clear'><\/div>","protected":false},"excerpt":{"rendered":"<p>I&#8217;ve been experimenting with data-efficient LLM training as part of a project I&#8217;m calling Oren, focused on&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[85],"tags":[],"class_list":["post-36323","post","type-post","status-publish","format-standard","hentry","category-datatards","wpcat-85-id"],"_links":{"self":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/36323","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/comments?post=36323"}],"version-history":[{"count":0,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/36323\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/media?parent=36323"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/categories?post=36323"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/tags?post=36323"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}