/motoso/Muse: Text-To-Image Generation via Masked Generative Transformers

generated at 2/12/2025, 5:34:50 PM
Muse: Text-To-Image Generation via Masked Generative Transformers
https://muse-model.github.io/
2301.00704 Muse: Text-To-Image Generation via Masked Generative Transformers
Submitted on 2 Jan 2023
Muse is a fast, state-of-the-art text-to-image generation and editing model.
Huiwen Chang*, Han Zhang*, Jarred Barber†, AJ Maschinot†, José Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T. Freeman, Michael Rubinstein†, Yuanzhen Li†, Dilip Krishnan†
*Equal contribution. †Core contribution.

>We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. 
我々は、拡散モデルや自己回帰モデルよりも大幅に効率的でありながら、最先端の画像生成性能を達成するテキストから画像への変換モデルであるMuseを発表する。
>Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. 
Museは離散トークン空間におけるマスクされたモデリングタスクで学習される。予め学習されたLLMから抽出されたテキスト埋め込みが与えられると、Museはランダムにマスクされた画像トークンを予測するように学習される。
>Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations;
Museは、ImagenやDALL-E 2などのピクセル空間拡散モデルと比較して、離散トークンを使用し、サンプリングの反復回数が少ないため、大幅に効率的です。
> compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding. 
Partiなどの自己回帰モデルと比較して、Museは並列デコードを使用するため、より効率的です
>The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships, pose, cardinality, etc. 
事前学習されたLLMを用いることで、きめ細かい言語理解が可能となり、高忠実度の画像生成や、オブジェクト、その空間的関係、姿勢、基数などの視覚的概念の理解につながります
ここ気になる
>Our 900M parameter model achieves a new SOTA on CC3M, with an FID score of 6.06. 
>The Muse 3B parameter model achieves an FID of 7.88 on zero-shot COCO evaluation, along with a CLIP score of 0.32. 
>Muse also directly enables a number of image editing applications without the need to fine-tune or invert the model: inpainting, outpainting, and mask-free editing.
900Mパラメータモデルは、CC3MにおいてFIDスコア6.06という新たなSOTAを達成しました。
Muse 3Bのパラメータモデルは、ゼロショットCOCO評価でFID7.88、CLIP0.32を達成しました。
Museは、インペインティング、アウトペインティング、マスクフリー編集など、モデルのファインチューニングや反転を必要としない多くの画像編集アプリケーションを直接可能にします。

invertとは？
Model Inversion
>Inversion(反転・逆転)という言葉が示す通り、モデルの入力と出力を反転させる手法です。
AIモデルから情報流出？学習データを復元する「Model Inversion Attack」を検証｜ブログ｜NRIセキュア