/work4ai/Chatbot Arena - Scrapbox Reader

generated at 2/12/2025, 8:44:02 AM
Chatbot Arena
https://lmarena.ai/?leaderboard🏆 LMSYS Chatbot Arena Leaderboard

https://huggingface.co/papers/2403.04132Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
https://lmsys.org/blog/2023-05-03-arena/Benchmarking LLMs in the Wild with Elo Ratings
Elo ratingを用いたLLMベンチマーク

データセット
https://huggingface.co/datasets/lmsys/chatbot_arena_conversationslmsys/chatbot_arena_conversations
>このデータセットには、2023年4月から6月にかけてChatbot Arenaで収集された、対になる人間の好みを含む33Kのクリーンな会話が含まれています。 

https://twitter.com/lmsysorg/status/16618183907833528332023/5/26
https://twitter.com/lmsysorg/status/16563873807048540162週目(2023/5/10)
このメンツでRWKVかなり上位なの面白い
1週目

仕組み
データ収集
FastChatを使用
ユーザーは2つの匿名モデルと並んでチャット
より優れていると思われるモデルに投票
投票が提出されると、モデル名が明らかにされる
→ チャットを続けるor別のモデルの組み合わせでリスタート
収集結果(1週目)
モデルの組み合わせの戦闘回数
ユーザーが使っていた言語
ほぼ英語
Elo rating
Chatbot Arena#645cacbde2dacc00000219ce
ペアワイズ法を使った勝率(左)とElo ratingを使ったペアワイズ勝率(右)の比較
2024/8/29 Style Control
スタイル文章の書き方や見た目（例えば、長さやマークダウンの使い方）がランキングにおいてどれだけ影響を及ぼしているのか調べる


mmluとかよりも人間の評価がやっぱり分かりやすいし腑に落ちる
日本語(にかかわらず他言語)の投票も増やして言語による評価の違いも知りたい

LMSYS ORG