/motoso/OpenAI o1 - Scrapbox Reader

generated at 2/12/2025, 12:43:34 AM
OpenAI o1
最初にリリースされたのはpreview

Introducing OpenAI o1 | OpenAI
数理が必要なユーザー向け
>These enhanced reasoning capabilities may be particularly useful if you’re tackling complex problems in science, coding, math, and similar fields.
> For example, o1 can be used by healthcare researchers to annotate cell sequencing data, by physicists to generate complicated mathematical formulas needed for quantum optics, and by developers in all fields to build and execute multi-step workflows. 
LLMサービスはユーザーに倫理を押し付けるが強化されている
>One way we measure safety is by testing how well our model continues to follow its safety rules if a user tries to bypass them (known as "jailbreaking"). On one of our hardest jailbreaking tests, GPT-4o scored 22 (on a scale of 0-100) while our o1-preview model scored 84. You can read more about this in the system card and our research post.
知識は小さくてよく、推論だっけあればいいコーディングにはOpenAI o1-miniが有用
OpenAI Platform
>o1 models excel in scientific reasoning, ranking in the 89th percentile on competitive programming questions (Codeforces), 
>placing among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), 
>exceeding human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA).


Learning to Reason with LLMs | OpenAI
時間がかかる
GPT-4などのchatbot向けではない
各種ベンチマークでGPT-4oを精度で圧倒
AIMEはアメリカの優秀な高校生が挑戦するテスト
従来のMATH^2やGSM8KはもうLLMの評価テストとして簡単すぎて役にたたない
GPQA Diamondは生物学・物理・数学のテスト。比較対象のexpert humanは博士号もち
Codeforcesは競技プログラミング
o1-previewは内部的なo1とは違うようだ
o1をさらに国際情報オリンピックに特化させたモデルo1-lolを作り、参加者と同じ条件でテストした。
平均的に50回提出するのでその制限にすると、平均点より60点ほど高かった
10000回まで回答提出を許すと金メダルレベルの正答率になった
すごすぎる
モデルを訓練した後、実際の試験（10回の提出が許可される）で試験した結果が上の図。出場者の93%より優れた結果を出している
MMMUの78.2%は専門家と同等レベル
論理性があるような科目（数学・物理・形式論理）は伸びているが、語学に関しては全く伸びていないから、論理があまり関係ないジャンルではこのモデルを使う必要がない
記事でも論じられている
LLM時代に人間に残されたのは非倫理や非合理であるということを再確認
Chain of Thoughtで戦略を洗練する。役に立たない戦略と判断したら別の戦略に切り替える

> @sama: here is o1, a series of our most capable and aligned models yet:
>  
>  https://openai.com/index/learning-to-reason-with-llms/
>  
>  o1 is still flawed, still limited, and it still seems more impressive on first use than it does after you spend more time with it.
>  




ユーザーレビュー
> @Yh_Taguchi: GPT-4oでは歯が立たなかった問題、GPT-o1で試しました。めちゃくちゃな計算は無くなりましたが「状態方程式」というからにはUが残っていてはだめ（P,V,T,Nだけで書くべき）です。(10),(12)式でUを消せばいいとは気づけなかった。GPT-o1が僕の学生だったら単位はあげません。まだまだですね。
>    

> @tarutaru247: o1 GPT-4oだとできなかった「～○○で終わるの書いて」ができるようになってる！
>   

> @momonga27300038: GPTの新しいモデルOpenAI o1-preview登場。gpt-4oでは解けなかった2024東北大前期理系数学大問3の確率の問題を解かせてみた。待つこと29秒、解答の記述が始まり全問正解。待っている間に思考の過程も確認できます。レートリミットは週30回orz。ご利用は計画的に。  ※画像を撮り直して再投稿
>   

> @momonga27300038: GPTの新しいモデルOpenAI o1-preview登場。gpt-4oでは解けなかった2024東北大前期理系数学大問3の確率の問題を解かせてみた。待つこと29秒、解答の記述が始まり全問正解。待っている間に思考の過程も確認できます。レートリミットは週30回orz。ご利用は計画的に。  ※画像を撮り直して再投稿
>   

> @ImAI_Eruel: ところで私は研究者ということもあり、OpenAIのChatGPT、Google Gemini pro、Anthropic Claudeの全てに課金している稀有なユーザー（周りの研究者は意外と一部は切ってる）なので、OpenAI o1を含め東大数学試験を解かせてみました
>  結果として確かにOpenAI o1はかなり精度よく解けました。…
>     

9.11が9.9より大きいとわかる（推論は...適当！）
> @tomo3141592653: これが今日発表された最新のAIだ！
>  

https://x.com/naotous/status/1865900062121009392?s=61&t=xKgElnjMcTKwRpLEEgVNdg