/motoso/LAION - Scrapbox Reader

generated at 2/12/2025, 1:19:13 AM
LAION
https://laion.ai/
Large-scale Artificial Intelligence Open Network
50億枚の画像のURLとキャプションの集まり
キャプションがいい加減なものを容易に見つけることができる
いちいち人間が確認できない量
https://note.com/shi3zblog/n/nc9a0d759abf7
>LAION datasets are simply indexes to the internet, i.e. lists of URLs to the original images together with the ALT texts found linked to those images.
>While we downloaded and calculated CLIP embeddings of the pictures to compute similarity scores between pictures and texts, we subsequently discarded all the photos.
>Any researcher using the datasets must reconstruct the images data by downloading the subset they are interested in. For this purpose, we suggest the img2dataset tool.
https://laion.ai/faq/
有志が作った

DanbooruのURLが含まれるらしい
自分では確認していないが、含まれていても全然不思議はない
https://huggingface.co/datasets/laion/laion2B-en で確認できる
120*2.7GBぐらいのテキストデータがあるので確認するのに一苦労。やりたくない
メタデータだけでMSFS2020より大きい！
>@kuronagirai: 自分用メモ
>Stable / Waifu diffusionの連想呪文
>LAION-5Bを見る限りStableは恐らくGoogle検索経由でDanbooruタグを学習してる
>そのためDanbooruタグは全てまとめられて一つの長い単語として認識されてる
>潤羽るしあの場合はAIは潤羽るしあではなく1girl aqua_hair bun butterfly~の形で学習されてる
>@kuronagirai: Danbooru仕様上タグは数字アルファベット順にソートされてるのでStablediffusionも同様並びで学習してる
>ので連想呪文内部で細かな変更を±した場合でも前後の並びで相互補完される
>人間で言うなら"大西洋"が太西洋や大酉洋になってもある程度は元が分かる
>けどこれが西洋大になったら連想できなくなる
>@kuronagirai: なので追加指示する場合はカンマで区切って
>連想呪文の前に持ってきた方がいいかも
>カンマが入れば別単語扱いされるので連想呪文の並び順制約は受けないはず
>ただ連想呪文はやっぱ長すぎるからこれ使っての版権キャラ呼び出しはマシになった程度でしかない…
>絵文字圧縮呪文なかったらなんも出来ん…
これは使いづらそうだ
https://github.com/rom1504/clip-retrieval