I am primarily looking for results of running the MMLU evaluation on modern large language models. I have been able to find some data here https://github.com/EleutherAI/lm-evaluation-harness/tree/master/results and will be asking them if/when, they can provide any additional data.
MMLU may be the most common evaluation run on LLMs recently, but it is very rare for papers to report more than a single final number and I have not been able to find datasets for the evaluations that were run for any major recent LLM papers.
submitted by /u/corey1505
[link] [comments]