generated at
11/17/2024, 9:28:31 PM
MMLU
https://arxiv.org/abs/2009.03300
Measuring Massive Multitask Language Understanding
#LLMベンチマーク
Claude 3.5 Sonnetが 90.4%でGPT-4を越えている
現状トップは
GPT-4
Steering at the Frontier: Extending the Power of Prompting - Microsoft Research
#Medprompt
https://www.youtube.com/watch?v=hVade_8H8mE
SmartGPT: Major Benchmark Broken - 89.0% on MMLU + Exam's Many Errors