/work4ai/PyLLMCoreでBakLLaVAを動かしてみる

generated at 2/12/2025, 10:13:53 PM
PyLLMCoreでBakLLaVAを動かしてみる
参考
https://github.com/advanced-stack/py-llm-core#llava---multi-modalities---mistral-visionPyLLMCore/LLaVA - Multi modalities - Mistral Vision
https://advanced-stack.com/resources/multi-modalities-inference-using-mistral-ai-llava-bakllava-and-llama-cpp.htmlMulti modalities inference using Mistral AI LLaVA vision model - BakLLaVA

PyLLMCoreライブラリを使ったけれど、LLaVA(VLM)は対応したばかりなので多分今後だいぶ変わる

PyLLMCoreインストール
$ python -m venv venv
$ venv/Scripts/activate
$ pip install py-llm-core

量子化されたBakLLaVAモデルとCLIPのダウンロード
https://huggingface.co/advanced-stack/bakllava-mistral-v1-gguf/tree/mainBakLLaVA-1-Q4_K_M.gguf
https://huggingface.co/advanced-stack/bakllava-mistral-v1-gguf/tree/mainBakLLaVA-1-clip-model.gguf
BakLLaVAは以下の場所に置く
$ C:\Users\ユーザー名\.cache\py-llm-core\models
CLIPは好きな場所に置く

コードを書く
main.pyfrom llm_core.llm import LLaVACPPModel

model = "BakLLaVA-1-Q4_K_M.gguf"

llm = LLaVACPPModel(
    name=model,
    llama_cpp_kwargs={
        "logits_all": True,
        "n_ctx": 8000,
        "verbose": False,
        "n_gpu_layers": 100,  #: Set to 0 if you don't have a GPU
        "n_threads": 1,       #: Set to the number of available CPU cores
        "clip_model_path": r"F:\BakLLaVA\model\BakLLaVA-1-clip-model.gguf",
    }
)

llm.load_model()

history = [
    {
        'role': 'user',
        'content': [
            {'type': 'image_url', 'image_url': "http://localhost:8000/pexels-photo-4946625.jpg"}
        ]
    }
]

response = llm.ask('Please describe this image as accurately as possible', history=history)

print(response.choices[0].message.content)
PyLLMCoreでBakLLaVAを動かしてみる#654cd33fe2dacc000072da12
なぜか画像をurlでしか指定できないので、 python -m http.server でローカルサーバーを作って無理やりurlで指定する

実行
$ python main.py

結果
BakLLaVA
> The image features a kitchen with a counter that has various items on it, including several coffee cups and bottles. There is a sink located in the middle of the counter. A large, colorful world map is displayed on the wall behind the counter, adding a unique touch to the space. 
> In addition to the cups and bottles, there are also some bowls placed on the counter. The kitchen appears to be well-equipped for coffee making and serving, with multiple cups and bottles available for use.
CogVLM
> There is a brown wooden cabinet and a kitchen countertop in the picture. On the left side of the cabinet, there is a black sink and two silver water pipes. There are also two thermos and several bottles of soap on the countertop. On the right side of the cabinet, there is a coffee maker and a black electric stove. There are two cups and a black electronic thermometer on the countertop. Behind the cabinet is a blue world map, with English, Chinese, and Latin letters on it.
WD14-tagger
>english_text, no_humans, window, table, scenery, paper

BakLLaVA
>The image features a young woman with black hair, wearing a red bow and sporting a large, red bubble in her mouth. She is also wearing an Asian-style school uniform with a tie. Her expression appears to be a mix of innocence and rebellion, as she chews on the gum with attitude.
CogVLM
>A cartoon girl wearing a gray sailor suit is blowing bubble gum. She has a black hair tie and a red tie. Her eyes are red, and she is wearing a black collar. The background is white.
WD14-tagger
>1girl, solo, breasts, looking_at_viewer, bangs, simple_background, shirt, black_hair, red_eyes, jewelry, school_uniform, monochrome, upper_body, earrings, serafuku, choker, sailor_collar, hair_bun, neckerchief, eyelashes, piercing, single_hair_bun, ear_piercing, red_neckerchief, spot_color, bubble_blowing, chewing_gum

LoRA学習のデータセットのキャプションをLLaVAにやらせようと思ったけれどどうだろう
WD14-taggerはアニメ系のキャラにはめっぽう強いな