{"id":39770,"date":"2026-03-19T06:27:15","date_gmt":"2026-03-19T05:27:15","guid":{"rendered":"https:\/\/www.graviton.at\/letterswaplibrary\/looking-for-datasets-where-multiple-llms-are-evaluated-on-the-same-prompts-for-routing-research-what-are-you-using\/"},"modified":"2026-03-19T06:27:15","modified_gmt":"2026-03-19T05:27:15","slug":"looking-for-datasets-where-multiple-llms-are-evaluated-on-the-same-prompts-for-routing-research-what-are-you-using","status":"publish","type":"post","link":"https:\/\/www.graviton.at\/letterswaplibrary\/looking-for-datasets-where-multiple-llms-are-evaluated-on-the-same-prompts-for-routing-research-what-are-you-using\/","title":{"rendered":"Looking For Datasets Where Multiple LLMs Are Evaluated On The Same Prompts (for Routing Research) \u2014 What Are You Using?"},"content":{"rendered":"<p><!-- SC_OFF --><\/p>\n<div class=\"md\">\n<p>Hey all,<\/p>\n<p>I&#8217;m building an LLM router (a system that routes each incoming prompt to the cheapest model likely to pass, rather than always sending everything to GPT-4). The core idea: if a prompt is simple enough for Mistral-7B, why pay for GPT-4? <\/p>\n<p>I\u2019m currently using the <a href=\"https:\/\/github.com\/withmartian\/routerbench\">RouterBench<\/a> dataset a lot. These kinds of data are incredibly valuable because you get multiple model outputs for the exact same prompts, plus metadata like cost\/quality, which makes it much easier to experiment with routing strategies and selection policies.<\/p>\n<p>I\u2019m wondering: are there other public datasets or benchmarks that provide:<\/p>\n<ul>\n<li>The same prompt \/ input evaluated by several different LLMs<\/li>\n<li>Full model outputs (not just scores)<\/li>\n<li>Ideally with some form of human or automated quality labels<\/li>\n<\/ul>\n<p>They don\u2019t have to be as big or polished as RouterBench, but anything in this spirit (evaluation logs, comparison datasets, crowdsourced model outputs, etc.) would be super helpful. Links to GitHub, Hugging Face datasets, papers with released generations, or hosted eval platforms that export data are all welcome.<\/p>\n<p>If you\u2019ve built your own multi-model eval logs and are open to sharing or partially anonymizing them, I\u2019d also love to hear about that.<\/p>\n<p>Thanks!<\/p>\n<\/div>\n<p><!-- SC_ON -->   submitted by   <a href=\"https:\/\/www.reddit.com\/user\/Apart-Dot-973\"> \/u\/Apart-Dot-973 <\/a> <br \/> <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1rxri2k\/looking_for_datasets_where_multiple_llms_are\/\">[link]<\/a><\/span>   <span><a href=\"https:\/\/www.reddit.com\/r\/datasets\/comments\/1rxri2k\/looking_for_datasets_where_multiple_llms_are\/\">[comments]<\/a><\/span><\/p><div class='watch-action'><div class='watch-position align-right'><div class='action-like'><a class='lbg-style1 like-39770 jlk' href='javascript:void(0)' data-task='like' data-post_id='39770' data-nonce='9941108d62' rel='nofollow'><img class='wti-pixel' src='https:\/\/www.graviton.at\/letterswaplibrary\/wp-content\/plugins\/wti-like-post\/images\/pixel.gif' title='Like' \/><span class='lc-39770 lc'>0<\/span><\/a><\/div><\/div> <div class='status-39770 status align-right'><\/div><\/div><div class='wti-clear'><\/div>","protected":false},"excerpt":{"rendered":"<p>Hey all, I&#8217;m building an LLM router (a system that routes each incoming prompt to the cheapest&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[85],"tags":[],"class_list":["post-39770","post","type-post","status-publish","format-standard","hentry","category-datatards","wpcat-85-id"],"_links":{"self":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/39770","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/comments?post=39770"}],"version-history":[{"count":0,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/posts\/39770\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/media?parent=39770"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/categories?post=39770"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.graviton.at\/letterswaplibrary\/wp-json\/wp\/v2\/tags?post=39770"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}