. In the paper, they suggested method that allows "painting with word". Basically, this is like make-a-scene, but with just using adjusted cross-attention score. You can see the results and detailed method in the paper.

 Their paper and their method was not open-sourced. Yet, paint-with-words can be implemented with Stable Diffusion since they share common Cross Attention module. So, I implemented it with 

NVIDIAの画像生成AI「eDiffi」の言葉とペイントで画像を生成する「paint with words」を画像生成AI「Stable Diffusion」で実現 - GIGAZINE