DeepFloyd Lab 联合 StabilityAI 开源的大模型 DeepFloyd IF

简介

本文将介绍 DeepFloyd IF ，这是一个最先进的开源文本生成图像模型（Text-to-Image），具有高度的逼真度和语言理解能力。

DeepFloyd IF是一个由冻结文本编码器和三个级联像素 diffusion 模块组成的模块：一个基于文本提示生成 64x64 px图像的基本模型和两个超分辨率模型，每个模型都设计用于生成分辨率不断提高的图像：256x256 px 和1024x1024 px。

模型的所有阶段都使用基于 T5 transformer 的冻结文本编码器来提取文本嵌入，然后将其输入到通过交叉注意力和注意力池增强的UNet架构中。

结果是一个高效的模型，其性能优于当前最先进的模型，在COCO数据集上实现了6.66的零样本FID得分。我们的工作强调了更大的UNet架构在 diffusion 模型的第一阶段的潜力，并描绘了文本生成图像的前景。

受具有深度语言理解的逼真 Text2Image diffusion 模型的启发

使用所有IF model 的最低要求：

用于 IF-I-XL (4.3B text to 64x64 base module) 和 IF-II-L (1.2B to 256x256 upscaler module) 的16GB vRAM
用于 IF-I-XL (4.3B text to 64x64 base module) 、 IF-II-L (1.2B to 256x256 upscaler module) 和 Stable x4 (to 1024x1024 upscaler) 的 24GB vRAM
xformers 并设置环境变量 FORCE_MEM_EFFICITE_ATTN=1

快速开始

在 Colab Hugging Face Spaces 中打开

pip install deepfloyd_if==1.0.2rc0
pip install xformers==0.0.16
pip install git+https://github.com/openai/CLIP.git --no-deps

本地 notebooks

Jupyter Notebook Kaggle

Dream、Style Transfer、Super Resolution或Inpainting模式可在 Jupyter Notebook 中使用。

集成 Diffusers

IF 同时集成了 Hugging Face Diffusers 库

Diffusers 单独运行每个阶段，允许用户自定义图像生成过程，并允许轻松检查中间结果。

例子

在使用IF之前，你需要接受其使用条件。为此：

确保有一个 Hugging Face 账户并登录
接受 DeepFloyd/IF-I-XL-v1.0 模型卡上的许可
请确保在本地登录

安装huggingface_hub：

pip install huggingface_hub --upgrade

在 Python shell中运行 login 函数

from huggingface_hub import login

login()

并输入你的 Hugging Face 中心访问令牌。

接下来我们安装 diffusers 和依赖项：

pip install diffusers accelerate transformers safetensors

现在我们可以在本地运行模型了。

默认情况下， diffusers 使用模型 cpu 卸载来运行整个IF管道，只有14GB的VRAM。

如果你使用的是 torch>=2.0.0，请确保删除所有enable_xformers_memory_efficient_attention() 函数。

from diffusers import DiffusionPipeline
from diffusers.utils import pt_to_pil
import torch

stage 1

stage_1 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
stage_1.enable_xformers_memory_efficient_attention()  # remove line if torch.__version__ >= 2.0.0
stage_1.enable_model_cpu_offload()

stage 2

stage_2 = DiffusionPipeline.from_pretrained(
    "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16
)
stage_2.enable_xformers_memory_efficient_attention()  # remove line if torch.__version__ >= 2.0.0
stage_2.enable_model_cpu_offload()

stage 3

safety_modules = {"feature_extractor": stage_1.feature_extractor, "safety_checker": stage_1.safety_checker, "watermarker": stage_1.watermarker}
stage_3 = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16)
stage_3.enable_xformers_memory_efficient_attention()  # remove line if torch.__version__ >= 2.0.0
stage_3.enable_model_cpu_offload()
prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"'

文本嵌入

prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)

generator = torch.manual_seed(0)

stage 1

image = stage_1(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt").images
pt_to_pil(image)[0].save("./if_stage_I.png")

stage 2

image = stage_2(
    image=image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt"
).images
pt_to_pil(image)[0].save("./if_stage_II.png")

stage 3

image = stage_3(prompt=prompt, image=image, generator=generator, noise_level=100).images
image[0].save("./if_stage_III.png")

有多种方法可以通过 diffusers 来加快推理时间并降低内存消耗。为此，请查看 diffusers 文档：

🚀 优化推理时间

⚙️ 针对推理过程中的低内存进行优化

有关如何使用 IF 的更多详细信息，请查看 IF 博客文章和文档📖.

在本地运行代码

将模型加载到VRAM中

from deepfloyd_if.modules import IFStageI, IFStageII, StableStageIII
from deepfloyd_if.modules.t5 import T5Embedder
device = 'cuda:0'
if_I = IFStageI('IF-I-XL-v1.0', device=device)
if_II = IFStageII('IF-II-L-v1.0', device=device)
if_III = StableStageIII('stable-diffusion-x4-upscaler', device=device)
t5 = T5Embedder(device="cpu")

I. Dream

Dream 是 IF 模型的文本合成图像模式

from deepfloyd_if.pipelines import dream

prompt = 'ultra close-up color photo portrait of rainbow owl with deer horns in the woods'
count = 4

result = dream(
    t5=t5, if_I=if_I, if_II=if_II, if_III=if_III,
    prompt=[prompt]*count,
    seed=42,
    if_I_kwargs={
        "guidance_scale": 7.0,
        "sample_timestep_respacing": "smart100",
    },
    if_II_kwargs={
        "guidance_scale": 4.0,
        "sample_timestep_respacing": "smart50",
    },
    if_III_kwargs={
        "guidance_scale": 9.0,
        "noise_level": 20,
        "sample_timestep_respacing": "75",
    },
)

if_III.show(result['III'], size=14)

II. Zero-shot Image-to-Image Translation

在风格迁移模式中，提示的输出以support_pil_img的样式显示

from deepfloyd_if.pipelines import style_transfer

result = style_transfer(
    t5=t5, if_I=if_I, if_II=if_II,
    support_pil_img=raw_pil_image,
    style_prompt=[
        'in style of professional origami',
        'in style of oil art, Tate modern',
        'in style of plastic building bricks',
        'in style of classic anime from 1990',
    ],
    seed=42,
    if_I_kwargs={
        "guidance_scale": 10.0,
        "sample_timestep_respacing": "10,10,10,10,10,10,10,10,0,0",
        'support_noise_less_qsample_steps': 5,
    },
    if_II_kwargs={
        "guidance_scale": 4.0,
        "sample_timestep_respacing": 'smart50',
        "support_noise_less_qsample_steps": 5,
    },
)
if_I.show(result['II'], 1, 20)

备选文本

III. Super Resolution

对于超分辨率，用户可以在不一定由 IF 生成的图像上运行 IF-II 和 IF-III 或“Stable x4”（两个级联）：

from deepfloyd_if.pipelines import super_resolution

middle_res = super_resolution(
    t5,
    if_III=if_II,
    prompt=['woman with a blue headscarf and a blue sweaterp, detailed picture, 4k dslr, best quality'],
    support_pil_img=raw_pil_image,
    img_scale=4.,
    img_size=64,
    if_III_kwargs={
        'sample_timestep_respacing': 'smart100',
        'aug_level': 0.5,
        'guidance_scale': 6.0,
    },
)
high_res = super_resolution(
    t5,
    if_III=if_III,
    prompt=[''],
    support_pil_img=middle_res['III'][0],
    img_scale=4.,
    img_size=256,
    if_III_kwargs={
        "guidance_scale": 9.0,
        "noise_level": 20,
        "sample_timestep_respacing": "75",
    },
)
show_superres(raw_pil_image, high_res['III'][0])

IV. Zero-shot Inpainting

from deepfloyd_if.pipelines import inpainting

result = inpainting(
    t5=t5, if_I=if_I,
    if_II=if_II,
    if_III=if_III,
    support_pil_img=raw_pil_image,
    inpainting_mask=inpainting_mask,
    prompt=[
        'oil art, a man in a hat',
    ],
    seed=42,
    if_I_kwargs={
        "guidance_scale": 7.0,
        "sample_timestep_respacing": "10,10,10,10,10,0,0,0,0,0",
        'support_noise_less_qsample_steps': 0,
    },
    if_II_kwargs={
        "guidance_scale": 4.0,
        'aug_level': 0.0,
        "sample_timestep_respacing": '100',
    },
    if_III_kwargs={
        "guidance_scale": 9.0,
        "noise_level": 20,
        "sample_timestep_respacing": "75",
    },
)

if_I.show(result['I'], 2, 3)
if_I.show(result['II'], 2, 6)
if_I.show(result['III'], 2, 14)

模型库

权重以及Model Card的链接将很快在模型库中的每个模型上提供。

Model Card是指一种记录和分享机器学习模型相关信息的文档。

原始模型

*表示是最好的模型

定量评价

FID = 6.66

鸣谢

特别感谢 StabilityAI 及其CEO Emad Mostaque提供的宝贵支持，为训练模型提供了GPU计算和基础设施（我们感谢Richard Vencu）；

特别感谢 LAION 和 Christoph Schuhmann 对该项目的贡献以及准备充分的数据集；感谢Huggingface团队在推理过程中优化了模型的速度和内存消耗，创建了演示并给出了很酷的建议！

🚀 外部贡献者 🚀

衷心感谢 @Apoliário，感谢他们在各个阶段提供的想法、咨询、帮助和支持，使IF可以开源；用于编写大量文档和说明；在困难时刻营造友好的氛围🦉;

感谢 @patrickvonplaten 将 unet 模型的加载时间提高了80%；用于集成 Stable-Diffusion-x4 作为本地管道💪;

感谢 @williamberman 和 @patrickvonplaten对 diffusers 的集成🙌;

感谢 @hysts 和 @Apoliário 用 IF 创建了最好的gradio演示🚀;

感谢 @Dango233 使用 xformers 内存高效注意力调整IF💪;

DeepFloyd Lab 联合 StabilityAI 开源的大模型 DeepFloyd IF

简介

快速开始

本地 notebooks

集成 Diffusers

例子

stage 1

stage 2

stage 3

文本嵌入

stage 1

stage 2

stage 3

在本地运行代码

I. Dream

II. Zero-shot Image-to-Image Translation

III. Super Resolution

IV. Zero-shot Inpainting

模型库

鸣谢

作者信息

文章信息

上一篇

下一篇