AIGC在营销图片生成技术综述

阿里云国内75折回扣微信号：monov8

阿里云国际，腾讯云国际，低至75折。AWS 93折免费开户实名账号代冲值优惠多多微信号：monov8 飞机：@monov6

基于文本生成素材

imagen

分析用户输入的文本并使用T5-XXL进行编码。嵌入在 AI 中的文本首先被转换为分辨率为64x64像素的小图像。Imagen进一步利用文本条件超分辨率扩散模型对图像进行64×64的上采样然后这个图像继续增长并最终形成。

Imagen 的开发者谷歌研究的大脑团队表示基于变压器和图像扩散模型Imagen实现了前所未有的真实感。谷歌声称对比其它模型在图像保真度和图像-文本匹配方面人类评估者更喜欢 Imagen。

不过谷歌也表示Imagen 是在从网络上抓取的数据集上进行训练的虽然已经过滤了很多不良内容如色情图像、污秽语言等但仍有大量不当的内容数据集因此也会存在种族主义诽谤和有害的社会刻板印象。

StableDiffusion

从名字Stable Diffusion就可以看出这个主要采用的扩散模型Diffusion Model。

简单来说扩散模型就是去噪自编码器的连续应用逐步生成图像的过程。

一般所言的扩散是反复在图像中添加小的、随机的噪声。而扩散模型则与这个过程相反——将噪声生成高清图像。训练的神经网络通常为U-net。

不过因为模型是直接在像素空间运行导致扩散模型的训练、计算成本十分昂贵。

基于这样的背景下Stable Diffusion主要分两步进行。

首先使用编码器将图像x压缩为较低维的潜在空间表示zx。

其中上下文Contexty即输入的文本提示用来指导x的去噪。

它与时间步长t一起以简单连接和交叉两种方式注入到潜在空间表示中去。

随后在zx基础上进行扩散与去噪。换言之就是模型并不直接在图像上进行计算从而减少了训练时间、效果更好。

值得一提的是Stable DIffusion的上下文机制非常灵活y不光可以是图像标签就是蒙版图像、场景分割、空间布局也能够相应完成。

素材多样化生成

基于文本ps多样化

DreamBooth

用户可以给定3-5张自己随意拍摄的某一物体的图片就能得到不同背景下的该物体的新颖再现同时又保留了其关键特征。

当然作者也表示这种方法并不局限于某个模型如果DALL·E2经过一些调整同样能实现这样的功能。

具体到方法上DreamBooth采用了给物体加上“特殊标识符”的方法。

也就是说原本图像生成模型收到的指令只是一类物体例如[cat]、[dog]等但现在DreamBooth会在这类物体前加上一个特殊标识符变成[V][物体类别]。

以下图为例将用户上传的三张狗子照片和相应的类名如“狗”作为输入信息得到一个经过微调的文本-图像扩散模型。

该扩散模型用“a [V] dog”来特指用户上传图片中的狗子再把其带入文字描述中生成特定的图像其中[V]就是那个特殊标识符。

至于为什么不直接用[V]来指代整个[特定物体]

作者表示受限于输入照片的数量模型无法很好地学习到照片中物体的整体特征反而可能出现过拟合。

因此这里采用了微调的思路整体上仍然基于AI已经学到的[物体类别]特征再用[V]学到的特殊特征来修饰它。

以生成一只白色的狗为例这里模型会通过[V]来学习狗的颜色白色、体型等个性化细节加上模型在[狗]这个大的类别中学到的狗的共性就能生成更多合理又不失个性的白狗的照片。

为了训练这个微调的文本-图像扩散模型研究人员首先根据给定的文本描述生成低分辨率图像这时生成的图像中狗子的形象是随机的。

然后再应用超分辨率的扩散模型进行替换把随机图像换成用户上传的特定狗子。

instructPix2Pix

InstructPix2Pix整合了目前较为成熟的两个大规模预训练模型语言模型GPT-3和文本图像生成模型Stable Diffusion生成了一个专用于图像编辑训练的数据集随后训练了一个条件引导型的扩散模型来完成这一任务。此外InstructPix2Pix模型可以在几秒钟内快速完成图像编辑操作这进一步提高了InstructPix2Pix的可用性和实用性。

InstructPix2Pix模型的整体构建流程分为两大部分1生成一个专用于图像编辑任务的数据集。2使用生成的数据集训练一个条件扩散模型该模型可以按照人类的指令对目标图像进行各种形式的编辑操作例如替换物体、更改图像本身的风格、修改图像的背景环境等等。

作者在InstructPix2Pix中整合了两个大规模预训练模型语言模型GPT-3和文本图像模型Stable Diffusion同时利用这两个模型中蕴含的知识构建了一个多模态训练数据集该数据集主要包含了由文本编辑指令和编辑前后对应图像构成的图像对。在构建过程中作者首先从文本编辑指令出发生成成对的图像描述。随后再根据这些描述生成对应的成对图像构成训练样本。

1. 生成成对的图像描述

在这一过程中需要先给定一个图像文本描述例如“一个女孩骑马的照片”如上图a中所示随后需要根据该文本描述生成一些合理的编辑指令例如“让一个女孩骑龙”更合理一点的描述为“一个女孩骑龙的照片”这一操作可以通过GPT-3类似的文本大模型完成。需要注意的是这些操作完全在文本域中进行这样做可以生成大量的、多样性的编辑指令同时能够保证图像变化和文本指令之间的对应关系。

具体来说作者对GPT-3进行了专门的微调首先收集了一个规模相对较小的人工编辑三元组数据集三元组包含1输入的图像描述2编辑指令3输出的图像描述数据集详细介绍如下表所示。

首先收集了700条图像描述样本然后手动编写了编辑指令和输出图像描述然后使用这700条样本对GPT-3模型进行微调微调后的模型可以自行生成详细的训练样本上表非常鲜明的展示了作者手动生成的样本和GPT-3随后生成样本的对比。

在得到成对的编辑指令后作者使用文本图像模型Stable Diffusion将这两个文本提示即编辑前和编辑后转换为一对相应的图像如上图b所示。然而这一过程仍然面临一个重大挑战目前的文本到图像模型无法保证图像内容身份信息的一致性即使在输入的条件提示变化非常小的情况下。

例如我们为模型指定两个非常相似的文本提示“一张猫的照片”和“一张黑猫的照片”模型可能会产生两只截然不同的猫的图像这对本文图像编辑的目的来说是不合理的。为了解决这一问题作者想到使用这些成对数据来训练模型编辑图像而不是遵循这些模型原本的生成模式去生成随机图像。

作者使用了最近新提出的Prompt-to-Prompt方法[3]来完成操作该方法可以针对一个输入文本生成多代近似的图像且这些图像彼此之间含有相同的身份信息Prompt-to-Prompt通过在去噪过程中使用交互注意力权重来实现。

InstructPix2Pix的建模本质是从隐空间扩散模型Latent Diffusion演变而来Latent Diffusion通过在带有编码器和解码器的预训练变分自动编码器的隐空间中运行来提高扩散模型的效率和质量。对于一个图像扩散过程将噪声添加到编码的隐层向量中产生一个噪声隐变量其中噪声等级随时间步数而增加。然后训练一个网络它可以预测在给定的图像条件和文本指令条件下添加到噪声隐变量中的噪声信息然后通过以下目标函数来优化模型

之前的工作[4]表明微调大型图像扩散模型往往比从头训练模型以完成图像翻译任务效果更好尤其是在配对训练数据有限的情况下。因此本文作者使用预训练的Stable Diffusion对模型进行初始化。为了赋予InstructPix2Pix图像编辑的能力作者在模型的第一个卷积层中增加了额外的条件输入通道。

为了进一步提高图像生成效果以及模型对输入条件的遵循程度作者在InstructPix2Pix中也引入了Classifier-free引导策略。Classifier-free扩散引导是一种权衡扩散模型生成的样本质量和多样性的方法。其中隐式分类器会将更高的可能性分配给条件以提高生成图像的视觉质量并使采样图像更好地与输入条件相符合。Classifier-free引导的训练需要同时联合训练有条件和无条件去噪的扩散模型并在推理时结合两个分数进行估计。

对于本文的任务作者设计了一个评分网络其中有两个条件输入图像和文本指令。在训练过程中使InstructPix2Pix能够针对两个或任一条件输入进行有条件或无条件去噪。为此作者引入了两个指导尺度和可以对其进行调整以权衡生成的样本与输入图像的遵循程度以及它们与编辑指令的遵循程度评分网络的分数估计如下

在下图的“将大卫变成半机械人”的例子中显示了这两个参数对生成样本的影响。控制与输入图像的相似性而控制与编辑指令的一致性。

基于3d建模多样化

单图生成3d模型

Point-E

Point-E 不输出传统意义上的 3D 图像它会生成点云或空间中代表 3D 形状的离散数据点集。Point-E 中的 E 是「效率」的缩写表示其比以前的 3D 对象生成方法更快。不过从计算的角度来看点云更容易合成但它们无法捕获对象的细粒度形状或纹理 —— 这是目前 Point-E 的一个关键限制。

为了解决这一问题OpenAI 团队训练了一个额外的人工智能系统来将 Point-E 的点云转换为网格。

Point-E 架构及运行原理

在独立的网格生成模型之外Point-E 主要由两个模型组成文本到图像模型和图像到 3D 模型。文本到图像模型类似于 OpenAI 自家的 DALL-E 2 和 Stable Diffusion 等生成模型系统在标记图像上进行训练以理解单词和视觉概念之间的关联。在图像生成之后图像到 3D 模型被输入一组与 3D 对象配对的图像训练出在两者之间有效转换的能力。

不是训练单个生成模型直接生成以文本为条件的点云而是将生成过程分为三个步骤。

首先生成一个以文本标题为条件的综合视图。

接下来生成⼀个基于合成视图的粗略点云1,024 个点。

最后生成了⼀个以低分辨率点云和合成视图为条件的精细点云4,096 个点。

在数百万个3D模型上训练模型后我们发现数据集的数据格式和质量差异很大这促使我们开发各种后处理步骤以确保更高的数据质量。

为了将所有的数据转换为⼀种通用格式我们使用Blender从20个随机摄像机角度将每个3D模型渲染为RGBAD图像Blender支持多种3D格式并带有优化的渲染引擎。

对于每个模型Blender脚本都将模型标准化为边界立方体配置标准照明设置最后使用Blender的内置实时渲染引擎导出RGBAD图像。

然后使用渲染将每个对象转换为彩色点云。首先通过计算每个RGBAD图像中每个像素的点来为每个对象构建⼀个密集点云。这些点云通常包含数十万个不均匀分布的点因此我们还使用最远点采样来创建均匀的4K点云。

通过直接从渲染构建点云我们能够避免直接从3D网格中采样可能出现的各种问题对模型中包含的点进行取样或处理以不寻常的文件格式存储的三维模型。

最后我们采用各种启发式方法来减少数据集中低质量模型的频率。

首先我们通过计算每个点云的SVD来消除平面对象只保留那些最小奇异值高于某个阈值的对象。

接下来我们通过CLIP特征对数据集进行聚类对于每个对象我们对所有渲染的特征进行平均。

我们发现一些集群包含许多低质量的模型类别而其他集群则显得更加多样化或可解释。

我们将这些集群分到几个不同质量的bucket中并使用所得bucket的加权混合作为我们的最终数据集。

多图合成3d模型

Nerf

NeRF 作为 ECCV2020 Best Paper Honorable Mention影响力巨大。如今各大CV&CG会议中都是此类工作又一个大坑。这个 method 使用隐式表达以 2d posed images 为监督完成 novel view synthesis。通俗来说就是用多张 2d 图片隐式重建三维场景其展示的生成效果让人十分震撼。

希望本文能让大家快速了解这项工作~

原文地址 NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
项目主页 NeRF: Neural Radiance Fields

pipeline

NeRF训练管线

首先NeRF的想法是将三维场景隐式存储在神经网络中我们只需要通过输入一个相机位姿就可以获得场景图片。NeRF将场景建模成一个连续的5D辐射场其实感觉就可以理解为隐式的体素描述。

已知场景中点的位置 (x,y,z) 和观察方向 (θ,ϕ) 神经网络 FΘ 会输出一个 (c,σ) 表示该方向的自发光颜色 c 和该点体素密度 σ 。然后使用 classical volume rendering 的渲染方程

C(r)=∫tntfT(t)σ(r(t))c(r(t),d)dt, where T(t)=exp⁡(−∫tntσ(r(s))ds)

The expected color C(r) of camera ray r(t) = o + td with near and far bounds tn and tf
The function T(t) denotes the accumulated transmittance along the ray from tn to t, i.e., the probability that the ray travels from tn to t without hitting any other particle.

我们就可以得到输入相机位姿条件下的视角图片然后和 ground truth 做损失即可完成可微优化。

其实思路总结下来十分简单

用 network 存体素信息(x,y,z,θ,ϕ)→(c,σ)

然后用体素渲染方程获得生成视角图片光线采样+积分

最后与原视角图片计算损失更新网络

需要注意的是体素在不同方向的自发光颜色是不一样的view-dependent具体效果参考原文 Fig.3 和 Fig.4 。view-dependent 可以表示各向异性的光学属性。

See Fig. 3 for an example of how our method uses the input viewing direction to represent non-Lambertian effects. As shown in Fig. 4, a model trained without view dependence (only x as input) has difficulty representing specularities.
一点疑问如果体素密度 view-dependent 会怎么样

Numerical estimation of volume rendering

在上一节我们介绍了NeRF生成图片的规则。注意到体素渲染方程是连续形式而在实际神经网络训练过程中无法完成的。因此我们需要将其改为离散形式进行近似计算Quadrature积分法

C^(r)=∑i=1NTi(1−exp⁡(−σiδi))ci, where Ti=exp⁡(−∑j=1i−1σjδj)

其中

ti∼U[tn+i−1N(tf−tn),tn+iN(tf−tn)]

δi=ti+1−ti

通过将视线路径均分成 N 段然后在每一段均匀地随机采样体素用于渲染计算。

文末给出了离散形式的证明

Hierarchical volume sampling

虽然前文我们使用离散的近似积分来进行体素渲染但是在场景中难免会存在 free space 和 occluded regions 这种应该对渲染结果无贡献的区间或者说没有意义的。为了更有效地采样本文采用分级表征渲染的思想[1]提高渲染效率即通过同时优化两个神经网络"coarse"和"fine"。

作者先使用分层采样得到 Nc 个点通过 coarse 的渲染方程的计算

C^c(r)=∑i=1Ncwici,wi=Ti(1−exp⁡(−σiδi))

对 ωi 进行归一化 w^i=wi/∑j=1Ncwj 得到分段常数概率密度函数然后通过逆变换采样inverse transform sampling获得 Nf 个点添加至原 Nc 个点中用于 fine 渲染。

逆变换采样在分布 p 的 CDF 值域上均匀采样与原分布 p 中的采样同分布

通过第二次采样我们所得到的采样点则会更多得使用对颜色计算贡献更有意义的体素进行计算。

最后的损失函数如下

L=∑r∈R[‖C^c(r)−C(r)‖22+‖C^f(r)−C(r)‖22]

Positional encoding

虽然前文描述的“隐式表示+体素渲染”十分美好但是我们通过下图No Position Encoding可知它的生成图片十分模糊可以理解为一种高频信息丢失。前人工作[2]指出神经网络倾向于学习低频信息而NeRF需要重建高清的场景对场景overfitting所以我们需要让模型关注场景的高频细节。于是我们将位置向量 x 和方向向量 d 转化为高频变量

γ(p)=(sin⁡(20πp),cos⁡(20πp),⋯,sin⁡(2L−1πp),cos⁡(2L−1πp))

可以理解为这种表示方法即便两个点在原空间中的距离很近很难分辨但通过Positional encoding后我们还是可以很轻松的分辨两个点

In our experiments, we set L = 10 for γ(x) and L = 4 for γ(d).

然后将Positional encoding后的 (x,y,z) 和 (θ,ϕ) 作为输入就可以生成更加清晰的图片。

合图多样化

Demo

文本生成图片

from diffusers import StableDiffusionPipeline
import torch

model_id = "dreamlike-art/dreamlike-photoreal-2.0"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "blender 3d model, a stand full body cute fluen ant doll with blue and white eyes of technology style,bright cinematic lighting, gopro, fisheye lens  closeup,highly detailed, digital painting, artstation, concept art, smooth,  volumetric light"
image = pipe(prompt).images[0]

image.save("./result.jpg")

" a ant doll , blue and white eyes,technology style "

prompt文本增加多样性

import argparse
import hashlib
import itertools
import math
import os
from pathlib import Path
from typing import Optional

import torch
import torch.nn.functional as F
import torch.utils.checkpoint
from torch.utils.data import Dataset

from accelerate import Accelerator
from accelerate.logging import get_logger
from accelerate.utils import set_seed
from diffusers import AutoencoderKL, DDPMScheduler, StableDiffusionPipeline, UNet2DConditionModel
from diffusers.optimization import get_scheduler
from huggingface_hub import HfFolder, Repository, whoami
from PIL import Image
from torchvision import transforms
from tqdm.auto import tqdm
from transformers import CLIPTextModel, CLIPTokenizer


logger = get_logger(__name__)


def parse_args(input_args=None):
    parser = argparse.ArgumentParser(description="Simple example of a training script.")
    parser.add_argument(
        "--pretrained_model_name_or_path",
        type=str,
        default=None,
        required=True,
        help="Path to pretrained model or model identifier from huggingface.co/models.",
    )
    parser.add_argument(
        "--revision",
        type=str,
        default=None,
        required=False,
        help="Revision of pretrained model identifier from huggingface.co/models.",
    )
    parser.add_argument(
        "--tokenizer_name",
        type=str,
        default=None,
        help="Pretrained tokenizer name or path if not the same as model_name",
    )
    parser.add_argument(
        "--instance_data_dir",
        type=str,
        default=None,
        required=True,
        help="A folder containing the training data of instance images.",
    )
    parser.add_argument(
        "--class_data_dir",
        type=str,
        default=None,
        required=False,
        help="A folder containing the training data of class images.",
    )
    parser.add_argument(
        "--instance_prompt",
        type=str,
        default=None,
        help="The prompt with identifier specifying the instance",
    )
    parser.add_argument(
        "--class_prompt",
        type=str,
        default=None,
        help="The prompt to specify images in the same class as provided instance images.",
    )
    parser.add_argument(
        "--with_prior_preservation",
        default=False,
        action="store_true",
        help="Flag to add prior preservation loss.",
    )
    parser.add_argument("--prior_loss_weight", type=float, default=1.0, help="The weight of prior preservation loss.")
    parser.add_argument(
        "--num_class_images",
        type=int,
        default=100,
        help=(
            "Minimal class images for prior preservation loss. If not have enough images, additional images will be"
            " sampled with class_prompt."
        ),
    )
    parser.add_argument(
        "--output_dir",
        type=str,
        default="text-inversion-model",
        help="The output directory where the model predictions and checkpoints will be written.",
    )
    parser.add_argument("--seed", type=int, default=None, help="A seed for reproducible training.")
    parser.add_argument(
        "--resolution",
        type=int,
        default=512,
        help=(
            "The resolution for input images, all the images in the train/validation dataset will be resized to this"
            " resolution"
        ),
    )
    parser.add_argument(
        "--center_crop", action="store_true", help="Whether to center crop images before resizing to resolution"
    )
    parser.add_argument(
        "--use_filename_as_label", action="store_true", help="Uses the filename as the image labels instead of the instance_prompt, useful for regularization when training for styles with wide image variance"
    )
    parser.add_argument(
        "--use_txt_as_label", action="store_true", help="Uses the filename.txt file's content as the image labels instead of the instance_prompt, useful for regularization when training for styles with wide image variance"
    )
    parser.add_argument("--train_text_encoder", action="store_true", help="Whether to train the text encoder")
    parser.add_argument(
        "--train_batch_size", type=int, default=4, help="Batch size (per device) for the training dataloader."
    )
    parser.add_argument(
        "--sample_batch_size", type=int, default=4, help="Batch size (per device) for sampling images."
    )
    parser.add_argument("--num_train_epochs", type=int, default=1)
    parser.add_argument(
        "--max_train_steps",
        type=int,
        default=None,
        help="Total number of training steps to perform.  If provided, overrides num_train_epochs.",
    )
    parser.add_argument(
        "--gradient_accumulation_steps",
        type=int,
        default=1,
        help="Number of updates steps to accumulate before performing a backward/update pass.",
    )
    parser.add_argument(
        "--gradient_checkpointing",
        action="store_true",
        help="Whether or not to use gradient checkpointing to save memory at the expense of slower backward pass.",
    )
    parser.add_argument(
        "--learning_rate",
        type=float,
        default=5e-6,
        help="Initial learning rate (after the potential warmup period) to use.",
    )
    parser.add_argument(
        "--scale_lr",
        action="store_true",
        default=False,
        help="Scale the learning rate by the number of GPUs, gradient accumulation steps, and batch size.",
    )
    parser.add_argument(
        "--lr_scheduler",
        type=str,
        default="constant",
        help=(
            'The scheduler type to use. Choose between ["linear", "cosine", "cosine_with_restarts", "polynomial",'
            ' "constant", "constant_with_warmup"]'
        ),
    )
    parser.add_argument(
        "--lr_warmup_steps", type=int, default=500, help="Number of steps for the warmup in the lr scheduler."
    )
    parser.add_argument(
        "--use_8bit_adam", action="store_true", help="Whether or not to use 8-bit Adam from bitsandbytes."
    )
    parser.add_argument("--adam_beta1", type=float, default=0.9, help="The beta1 parameter for the Adam optimizer.")
    parser.add_argument("--adam_beta2", type=float, default=0.999, help="The beta2 parameter for the Adam optimizer.")
    parser.add_argument("--adam_weight_decay", type=float, default=1e-2, help="Weight decay to use.")
    parser.add_argument("--adam_epsilon", type=float, default=1e-08, help="Epsilon value for the Adam optimizer")
    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
    parser.add_argument("--push_to_hub", action="store_true", help="Whether or not to push the model to the Hub.")
    parser.add_argument("--hub_token", type=str, default=None, help="The token to use to push to the Model Hub.")
    parser.add_argument(
        "--hub_model_id",
        type=str,
        default=None,
        help="The name of the repository to keep in sync with the local `output_dir`.",
    )
    parser.add_argument(
        "--logging_dir",
        type=str,
        default="logs",
        help=(
            "[TensorBoard](https://www.tensorflow.org/tensorboard) log directory. Will default to"
            " *output_dir/runs/**CURRENT_DATETIME_HOSTNAME***."
        ),
    )
    parser.add_argument(
        "--log_with",
        type=str,
        default="tensorboard",
        choices=["tensorboard", "wandb"]
    )
    parser.add_argument(
        "--mixed_precision",
        type=str,
        default="no",
        choices=["no", "fp16", "bf16"],
        help=(
            "Whether to use mixed precision. Choose"
            "between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
            "and an Nvidia Ampere GPU."
        ),
    )
    parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank")
    parser.add_argument("--save_model_every_n_steps", type=int)
    parser.add_argument("--auto_test_model", action="store_true", help="Whether or not to automatically test the model after saving it")
    parser.add_argument("--test_prompt", type=str, default="A photo of a cat", help="The prompt to use for testing the model.")
    parser.add_argument("--test_prompts_file", type=str, default=None, help="The file containing the prompts to use for testing the model.example: test_prompts.txt, each line is a prompt")
    parser.add_argument("--test_negative_prompt", type=str, default="", help="The negative prompt to use for testing the model.")
    parser.add_argument("--test_seed", type=int, default=42, help="The seed to use for testing the model.")
    parser.add_argument("--test_num_per_prompt", type=int, default=1, help="The number of images to generate per prompt.")
    
    if input_args is not None:
        args = parser.parse_args(input_args)
    else:
        args = parser.parse_args()

    env_local_rank = int(os.environ.get("LOCAL_RANK", -1))
    if env_local_rank != -1 and env_local_rank != args.local_rank:
        args.local_rank = env_local_rank

    if args.instance_data_dir is None:
        raise ValueError("You must specify a train data directory.")

    if args.with_prior_preservation:
        if args.class_data_dir is None:
            raise ValueError("You must specify a data directory for class images.")
        if args.class_prompt is None:
            raise ValueError("You must specify prompt for class images.")

    return args

# turns a path into a filename without the extension
def get_filename(path):
    return path.stem

def get_label_from_txt(path):
    txt_path = path.with_suffix(".txt") # get the path to the .txt file
    if txt_path.exists():
        with open(txt_path, "r") as f:
            return f.read()
    else:
        return ""

class DreamBoothDataset(Dataset):
    """
    A dataset to prepare the instance and class images with the prompts for fine-tuning the model.
    It pre-processes the images and the tokenizes prompts.
    """

    def __init__(
        self,
        instance_data_root,
        instance_prompt,
        tokenizer,
        class_data_root=None,
        class_prompt=None,
        size=512,
        center_crop=False,
        use_filename_as_label=False,
        use_txt_as_label=False,
    ):
        self.size = size
        self.center_crop = center_crop
        self.tokenizer = tokenizer

        self.instance_data_root = Path(instance_data_root)
        if not self.instance_data_root.exists():
            raise ValueError("Instance images root doesn't exists.")

        self.instance_images_path = list(self.instance_data_root.glob("*.jpg")) + list(self.instance_data_root.glob("*.png"))
        self.num_instance_images = len(self.instance_images_path)
        self.instance_prompt = instance_prompt
        self.use_filename_as_label = use_filename_as_label
        self.use_txt_as_label = use_txt_as_label
        self._length = self.num_instance_images

        if class_data_root is not None:
            self.class_data_root = Path(class_data_root)
            self.class_data_root.mkdir(parents=True, exist_ok=True)
            self.class_images_path = list(self.class_data_root.glob("*.jpg")) + list(self.class_data_root.glob("*.png"))
            self.num_class_images = len(self.class_images_path)
            self._length = max(self.num_class_images, self.num_instance_images)
            self.class_prompt = class_prompt
        else:
            self.class_data_root = None

        self.image_transforms = transforms.Compose(
            [
                transforms.Resize(size, interpolation=transforms.InterpolationMode.BILINEAR),
                transforms.CenterCrop(size) if center_crop else transforms.RandomCrop(size),
                transforms.ToTensor(),
                transforms.Normalize([0.5], [0.5]),
            ]
        )

    def __len__(self):
        return self._length

    def __getitem__(self, index):
        example = {}
        path = self.instance_images_path[index % self.num_instance_images]
        prompt = get_filename(path) if self.use_filename_as_label else self.instance_prompt
        prompt = get_label_from_txt(path) if self.use_txt_as_label else prompt
        
        print("prompt", prompt)
        
        instance_image = Image.open(path)
        if not instance_image.mode == "RGB":
            instance_image = instance_image.convert("RGB")
        example["instance_images"] = self.image_transforms(instance_image)
        example["instance_prompt_ids"] = self.tokenizer(
            prompt,
            padding="do_not_pad",
            truncation=True,
            max_length=self.tokenizer.model_max_length,
        ).input_ids

        if self.class_data_root:
            class_image = Image.open(self.class_images_path[index % self.num_class_images])
            if not class_image.mode == "RGB":
                class_image = class_image.convert("RGB")
            example["class_images"] = self.image_transforms(class_image)
            example["class_prompt_ids"] = self.tokenizer(
                self.class_prompt,
                padding="do_not_pad",
                truncation=True,
                max_length=self.tokenizer.model_max_length,
            ).input_ids

        return example


class PromptDataset(Dataset):
    "A simple dataset to prepare the prompts to generate class images on multiple GPUs."

    def __init__(self, prompt, num_samples):
        self.prompt = prompt
        self.num_samples = num_samples

    def __len__(self):
        return self.num_samples

    def __getitem__(self, index):
        example = {}
        example["prompt"] = self.prompt
        example["index"] = index
        return example


def get_full_repo_name(model_id: str, organization: Optional[str] = None, token: Optional[str] = None):
    if token is None:
        token = HfFolder.get_token()
    if organization is None:
        username = whoami(token)["name"]
        return f"{username}/{model_id}"
    else:
        return f"{organization}/{model_id}"

def test_model(folder, args):
    if args.test_prompts_file is not None:
        with open(args.test_prompts_file, "r") as f:
            prompts = f.read().splitlines()
    else:
        prompts = [args.test_prompt]
    
    test_path = os.path.join(folder, "test")
    if not os.path.exists(test_path):
        os.makedirs(test_path)
    
    print("Testing the model...")
    from diffusers import DDIMScheduler
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    torch_dtype = torch.float16 if device.type == "cuda" else torch.float32
    pipeline = StableDiffusionPipeline.from_pretrained(
        folder,
        torch_dtype=torch_dtype,
        safety_checker=None,
        load_in_8bit=True,
        scheduler = DDIMScheduler(
            beta_start=0.00085,
            beta_end=0.012,
            beta_schedule="scaled_linear",
            clip_sample=False,
            set_alpha_to_one=False,
        ),
    )
    pipeline.set_progress_bar_config(disable=True)
    pipeline.enable_attention_slicing()
    pipeline = pipeline.to(device)

    torch.manual_seed(args.test_seed)
    with torch.autocast('cuda'): 
        for prompt in prompts:
            print(f"Generating test images for prompt: {prompt}")
            test_images = pipeline(
                prompt=prompt,
                width=512,
                height=512,
                negative_prompt=args.test_negative_prompt,
                num_inference_steps=30, 
                num_images_per_prompt=args.test_num_per_prompt,
            ).images
            
            for index, image in enumerate(test_images):
                image.save(f"{test_path}/{prompt}_{index}.png")
    
    del pipeline
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        
    print(f"Test completed.The examples are saved in {test_path}")


def save_model(accelerator, unet, text_encoder, args, step=None):
    unet = accelerator.unwrap_model(unet) 
    text_encoder = accelerator.unwrap_model(text_encoder)

    if step == None:
        folder = args.output_dir
    else:
        folder = args.output_dir + "-Step-" + str(step)

    print("Saving Model Checkpoint...")
    print("Directory: " + folder)

    # Create the pipeline using using the trained modules and save it.
    if accelerator.is_main_process:
        pipeline = StableDiffusionPipeline.from_pretrained(
            args.pretrained_model_name_or_path,
            unet=unet,
            text_encoder=text_encoder,
            revision=args.revision,
        )
        pipeline.save_pretrained(folder)
        del pipeline
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        
        if args.auto_test_model:
            print("Testing Model...")
            test_model(folder, args)

        if args.push_to_hub:
            repo.push_to_hub(commit_message="End of training", blocking=False, auto_lfs_prune=True)


def main(args):
    logging_dir = Path(args.logging_dir)

    accelerator = Accelerator(
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        mixed_precision=args.mixed_precision,
        log_with=args.log_with,
        logging_dir=logging_dir,
    )


    # Currently, it's not possible to do gradient accumulation when training two models with accelerate.accumulate
    # This will be enabled soon in accelerate. For now, we don't allow gradient accumulation when training two models.
    # TODO (patil-suraj): Remove this check when gradient accumulation with two models is enabled in accelerate.
    if args.train_text_encoder and args.gradient_accumulation_steps > 1 and accelerator.num_processes > 1:
        raise ValueError(
            "Gradient accumulation is not supported when training the text encoder in distributed training. "
            "Please set gradient_accumulation_steps to 1. This feature will be supported in the future."
        )

    if args.seed is not None:
        set_seed(args.seed)

    if args.with_prior_preservation:
        class_images_dir = Path(args.class_data_dir)
        if not class_images_dir.exists():
            class_images_dir.mkdir(parents=True)
        cur_class_images = len(list(class_images_dir.iterdir()))

        if cur_class_images < args.num_class_images:
            torch_dtype = torch.float16 if accelerator.device.type == "cuda" else torch.float32
            pipeline = StableDiffusionPipeline.from_pretrained(
                args.pretrained_model_name_or_path,
                torch_dtype=torch_dtype,
                safety_checker=None,
                revision=args.revision,
            )
            pipeline.set_progress_bar_config(disable=True)

            num_new_images = args.num_class_images - cur_class_images
            logger.info(f"Number of class images to sample: {num_new_images}.")

            sample_dataset = PromptDataset(args.class_prompt, num_new_images)
            sample_dataloader = torch.utils.data.DataLoader(sample_dataset, batch_size=args.sample_batch_size)

            sample_dataloader = accelerator.prepare(sample_dataloader)
            pipeline.to(accelerator.device)

            for example in tqdm(
                sample_dataloader, desc="Generating class images", disable=not accelerator.is_local_main_process
            ):
                images = pipeline(example["prompt"]).images

                for i, image in enumerate(images):
                    hash_image = hashlib.sha1(image.tobytes()).hexdigest()
                    image_filename = class_images_dir / f"{example['index'][i] + cur_class_images}-{hash_image}.jpg"
                    image.save(image_filename)

            del pipeline
            if torch.cuda.is_available():
                torch.cuda.empty_cache()

    # Handle the repository creation
    if accelerator.is_main_process:
        if args.push_to_hub:
            if args.hub_model_id is None:
                repo_name = get_full_repo_name(Path(args.output_dir).name, token=args.hub_token)
            else:
                repo_name = args.hub_model_id
            repo = Repository(args.output_dir, clone_from=repo_name)

            with open(os.path.join(args.output_dir, ".gitignore"), "w+") as gitignore:
                if "step_*" not in gitignore:
                    gitignore.write("step_*\n")
                if "epoch_*" not in gitignore:
                    gitignore.write("epoch_*\n")
        elif args.output_dir is not None:
            os.makedirs(args.output_dir, exist_ok=True)

    # Load the tokenizer
    if args.tokenizer_name:
        tokenizer = CLIPTokenizer.from_pretrained(
            args.tokenizer_name,
            revision=args.revision,
        )
    elif args.pretrained_model_name_or_path:
        tokenizer = CLIPTokenizer.from_pretrained(
            args.pretrained_model_name_or_path,
            subfolder="tokenizer",
            revision=args.revision,
        )

    # Load models and create wrapper for stable diffusion
    text_encoder = CLIPTextModel.from_pretrained(
        args.pretrained_model_name_or_path,
        subfolder="text_encoder",
        revision=args.revision,
    )
    vae = AutoencoderKL.from_pretrained(
        args.pretrained_model_name_or_path,
        subfolder="vae",
        revision=args.revision,
    )
    unet = UNet2DConditionModel.from_pretrained(
        args.pretrained_model_name_or_path,
        subfolder="unet",
        revision=args.revision,
    )

    vae.requires_grad_(False)
    if not args.train_text_encoder:
        text_encoder.requires_grad_(False)

    if args.gradient_checkpointing:
        unet.enable_gradient_checkpointing()
        if args.train_text_encoder:
            text_encoder.gradient_checkpointing_enable()

    if args.scale_lr:
        args.learning_rate = (
            args.learning_rate * args.gradient_accumulation_steps * args.train_batch_size * accelerator.num_processes
        )

    # Use 8-bit Adam for lower memory usage or to fine-tune the model in 16GB GPUs
    if args.use_8bit_adam:
        try:
            import bitsandbytes as bnb
        except ImportError:
            raise ImportError(
                "To use 8-bit Adam, please install the bitsandbytes library: `pip install bitsandbytes`."
            )

        optimizer_class = bnb.optim.AdamW8bit
    else:
        optimizer_class = torch.optim.AdamW

    params_to_optimize = (
        itertools.chain(unet.parameters(), text_encoder.parameters()) if args.train_text_encoder else unet.parameters()
    )
    optimizer = optimizer_class(
        params_to_optimize,
        lr=args.learning_rate,
        betas=(args.adam_beta1, args.adam_beta2),
        weight_decay=args.adam_weight_decay,
        eps=args.adam_epsilon,
    )

    noise_scheduler = DDPMScheduler.from_config(args.pretrained_model_name_or_path, subfolder="scheduler")

    train_dataset = DreamBoothDataset(
        instance_data_root=args.instance_data_dir,
        instance_prompt=args.instance_prompt,
        class_data_root=args.class_data_dir if args.with_prior_preservation else None,
        class_prompt=args.class_prompt,
        tokenizer=tokenizer,
        size=args.resolution,
        center_crop=args.center_crop,
        use_filename_as_label=args.use_filename_as_label,
        use_txt_as_label=args.use_txt_as_label,
    )

    def collate_fn(examples):
        input_ids = [example["instance_prompt_ids"] for example in examples]
        pixel_values = [example["instance_images"] for example in examples]

        # Concat class and instance examples for prior preservation.
        # We do this to avoid doing two forward passes.
        if args.with_prior_preservation:
            input_ids += [example["class_prompt_ids"] for example in examples]
            pixel_values += [example["class_images"] for example in examples]

        pixel_values = torch.stack(pixel_values)
        pixel_values = pixel_values.to(memory_format=torch.contiguous_format).float()

        input_ids = tokenizer.pad({"input_ids": input_ids}, padding=True, return_tensors="pt").input_ids

        batch = {
            "input_ids": input_ids,
            "pixel_values": pixel_values,
        }
        return batch

    train_dataloader = torch.utils.data.DataLoader(
        train_dataset, batch_size=args.train_batch_size, shuffle=True, collate_fn=collate_fn, num_workers=1
    )

    # Scheduler and math around the number of training steps.
    overrode_max_train_steps = False
    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
    if args.max_train_steps is None:
        args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
        overrode_max_train_steps = True

    lr_scheduler = get_scheduler(
        args.lr_scheduler,
        optimizer=optimizer,
        num_warmup_steps=args.lr_warmup_steps * args.gradient_accumulation_steps,
        num_training_steps=args.max_train_steps * args.gradient_accumulation_steps,
    )

    if args.train_text_encoder:
        unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
            unet, text_encoder, optimizer, train_dataloader, lr_scheduler
        )
    else:
        unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
            unet, optimizer, train_dataloader, lr_scheduler
        )

    weight_dtype = torch.float32
    if args.mixed_precision == "fp16":
        weight_dtype = torch.float16
    elif args.mixed_precision == "bf16":
        weight_dtype = torch.bfloat16

    # Move text_encode and vae to gpu.
    # For mixed precision training we cast the text_encoder and vae weights to half-precision
    # as these models are only used for inference, keeping weights in full precision is not required.
    vae.to(accelerator.device, dtype=weight_dtype)
    if not args.train_text_encoder:
        text_encoder.to(accelerator.device, dtype=weight_dtype)

    # We need to recalculate our total training steps as the size of the training dataloader may have changed.
    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
    if overrode_max_train_steps:
        args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
    # Afterwards we recalculate our number of training epochs
    args.num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)

    # We need to initialize the trackers we use, and also store our configuration.
    # The trackers initializes automatically on the main process.
    if accelerator.is_main_process:
        accelerator.init_trackers("dreambooth", config=vars(args))

    # Train!
    total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps

    logger.info("***** Running training *****")
    logger.info(f"  Num examples = {len(train_dataset)}")
    logger.info(f"  Num batches each epoch = {len(train_dataloader)}")
    logger.info(f"  Num Epochs = {args.num_train_epochs}")
    logger.info(f"  Instantaneous batch size per device = {args.train_batch_size}")
    logger.info(f"  Total train batch size (w. parallel, distributed & accumulation) = {total_batch_size}")
    logger.info(f"  Gradient Accumulation steps = {args.gradient_accumulation_steps}")
    logger.info(f"  Total optimization steps = {args.max_train_steps}")
    # Only show the progress bar once on each machine.
    progress_bar = tqdm(range(args.max_train_steps), disable=not accelerator.is_local_main_process)
    progress_bar.set_description("Steps")
    global_step = 0

    for epoch in range(args.num_train_epochs):
        unet.train()
        if args.train_text_encoder:
            text_encoder.train()
        for step, batch in enumerate(train_dataloader):
            with accelerator.accumulate(unet):
                # Convert images to latent space
                latents = vae.encode(batch["pixel_values"].to(dtype=weight_dtype)).latent_dist.sample()
                latents = latents * 0.18215

                # Sample noise that we'll add to the latents
                noise = torch.randn_like(latents)
                bsz = latents.shape[0]
                # Sample a random timestep for each image
                timesteps = torch.randint(0, noise_scheduler.config.num_train_timesteps, (bsz,), device=latents.device)
                timesteps = timesteps.long()

                # Add noise to the latents according to the noise magnitude at each timestep
                # (this is the forward diffusion process)
                noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)

                # Get the text embedding for conditioning
                encoder_hidden_states = text_encoder(batch["input_ids"])[0]

                # Predict the noise residual
                noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample

                if args.with_prior_preservation:
                    # Chunk the noise and noise_pred into two parts and compute the loss on each part separately.
                    noise_pred, noise_pred_prior = torch.chunk(noise_pred, 2, dim=0)
                    noise, noise_prior = torch.chunk(noise, 2, dim=0)

                    # Compute instance loss
                    loss = F.mse_loss(noise_pred.float(), noise.float(), reduction="none").mean([1, 2, 3]).mean()

                    # Compute prior loss
                    prior_loss = F.mse_loss(noise_pred_prior.float(), noise_prior.float(), reduction="mean")

                    # Add the prior loss to the instance loss.
                    loss = loss + args.prior_loss_weight * prior_loss
                else:
                    loss = F.mse_loss(noise_pred.float(), noise.float(), reduction="mean")

                accelerator.backward(loss)
                if accelerator.sync_gradients:
                    params_to_clip = (
                        itertools.chain(unet.parameters(), text_encoder.parameters())
                        if args.train_text_encoder
                        else unet.parameters()
                    )
                    accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
                optimizer.step()
                lr_scheduler.step()
                optimizer.zero_grad()

            # Checks if the accelerator has performed an optimization step behind the scenes
            if accelerator.sync_gradients:
                progress_bar.update(1)
                global_step += 1

            logs = {"loss": loss.detach().item(), "lr": lr_scheduler.get_last_lr()[0]}
            progress_bar.set_postfix(**logs)
            accelerator.log(logs, step=global_step)

            if global_step >= args.max_train_steps:
                break


            if args.save_model_every_n_steps != None and (global_step % args.save_model_every_n_steps) == 0:
                save_model(accelerator, unet, text_encoder, args, global_step)

        accelerator.wait_for_everyone()

    save_model(accelerator, unet, text_encoder, args, step=None)

    accelerator.end_training()


if __name__ == "__main__":
    args = parse_args()
    main(args)

# 用于训练特定物体/人物的方法只需单一标签
export MODEL_NAME="./model"
export INSTANCE_DIR="./datasets/test2"
export OUTPUT_DIR="./new_model"
export CLASS_DIR="./datasets/class" # 用于存放模型生成的先验知识的图片文件夹请勿改动
export LOG_DIR="/root/tf-logs"
export TEST_PROMPTS_FILE="./test_prompts_object.txt"

rm -rf $CLASS_DIR/* # 如果你要训练与上次不同的特定物体/人物需要先清空该文件夹。其他时候可以注释掉这一行前面加#
rm -rf $LOG_DIR/*

accelerate launch tools/train_dreambooth.py \
  --train_text_encoder \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --mixed_precision="fp16" \
  --instance_data_dir=$INSTANCE_DIR \
  --instance_prompt="a photo of <xxx> dog" \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --class_prompt="a photo of dog" \
  --class_data_dir=$CLASS_DIR \
  --num_class_images=200 \
  --output_dir=$OUTPUT_DIR \
  --logging_dir=$LOG_DIR \
  --center_crop \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 --gradient_checkpointing \
  --use_8bit_adam \
  --learning_rate=2e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --auto_test_model \
  --test_prompts_file=$TEST_PROMPTS_FILE \
  --test_seed=123 \
  --test_num_per_prompt=3 \
  --max_train_steps=1000 \
  --save_model_every_n_steps=500 

# 如果max_train_steps改大了请记得把save_model_every_n_steps也改大
# 不然磁盘很容易中间就满了

# 以下是核心参数介绍
# 主要的几个
# --train_text_encoder 训练文本编码器
# --mixed_precision="fp16" 混合精度训练
# - center_crop 
# 是否裁剪图片一般如果你的数据集不是正方形的话需要裁剪
# - resolution 
# 图片的分辨率一般是512使用该参数会自动缩放输入图像
# 可以配合center_crop使用达到裁剪成正方形并缩放到512*512的效果
# - instance_prompt
# 如果你希望训练的是特定的人物使用该参数
# 如 --instance_prompt="a photo of <xxx> girl"
# - class_prompt
# 如果你希望训练的是某个特定的类别使用该参数可能提升一定的训练效果
# - use_txt_as_label
# 是否读取与图片同名的txt文件作为label
# 如果你要训练的是整个大模型的图像风格那么可以使用该参数
# 该选项会忽略instance_prompt参数传入的内容
# - learning_rate
# 学习率一般是2e-6是训练中需要调整的关键参数
# 太大会导致模型不收敛太小的话训练速度会变慢
# - lr_scheduler, 可选项有constant, linear, cosine, cosine_with_restarts, cosine_with_hard_restarts
# 学习率调整策略一般是constant即不调整如果你的数据集很大可以尝试其他的但是可能会导致模型不收敛需要调整学习率
# - lr_warmup_steps如果你使用的是constant那么这个参数可以忽略
# 如果使用其他的那么这个参数可以设置为0即不使用warmup
# 也可以设置为其他的值比如1000即在前1000个step中学习率从0慢慢增加到learning_rate的值
# 一般不需要设置, 除非你的数据集很大训练收敛很慢
# - max_train_steps
# 训练的最大步数一般是1000如果你的数据集比较大那么可以适当增大该值
# - save_model_every_n_steps
# 每多少步保存一次模型方便查看中间训练的结果找出最优的模型也可以用于断点续训

# --with_prior_preservation--prior_loss_weight=1.0分别是使用先验知识保留和先验损失权重
# 如果你的数据样本比较少那么可以使用这两个参数可以提升训练效果还可以防止过拟合即生成的图片与训练的图片相似度过高

# --auto_test_model, --test_prompts_file, --test_seed, --test_num_per_prompt
# 分别是自动测试模型每save_model_every_n_steps步后、测试的文本、随机种子、每个文本测试的次数
# 测试的样本图片会保存在模型输出目录下的test文件夹中

from diffusers import StableDiffusionPipeline
import torch
from diffusers import DDIMScheduler

model_path = "./new_model"  
prompt = "a cute girl, blue eyes, brown hair"
torch.manual_seed(123123123)

pipe = StableDiffusionPipeline.from_pretrained(
        model_path, 
        torch_dtype=torch.float16,
        scheduler=DDIMScheduler(
            beta_start=0.00085,
            beta_end=0.012,
            beta_schedule="scaled_linear",
            clip_sample=False,
            set_alpha_to_one=True,
        ),
        safety_checker=None
    )

# def dummy(images, **kwargs):
#     return images, False
# pipe.safety_checker = dummy
pipe = pipe.to("cuda")
images = pipe(prompt, width=512, height=512, num_inference_steps=30, num_images_per_prompt=3).images
for i, image in enumerate(images):
    image.save(f"test-{i}.png")

Point-E单图建模增加多样性

blender修改颜色、光照、纹理

InstructPix2Pix文本增加多样性效果

参考

[1]https://imagen.research.google/

[2]https://twitter.com/ai__pub/status/1561362542487695360

[3]https://stability.ai/blog/stable-diffusion-announcement

[4]https://arxiv.org/abs/2112.10752

[5]https://dreambooth.github.io/

[6]https://twitter.com/natanielruizg/status/1563166568195821569

[7]https://natanielruiz.github.io/

[8] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.

[9] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.

[10] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel CohenOr. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.

[11] Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong Chen, Qifeng Chen, and Fang Wen. Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952, 2022.

[12]https://techcrunch.com/2022/12/20/openai-releases-point-e-an-ai-that-generates-3d-models/?tpcc=tcplustwitter

[13]https://www.engadget.com/openai-releases-point-e-dall-e-3d-text-modeling-210007892.html?src=rss

阿里云国内75折回扣微信号：monov8

阿里云国际，腾讯云国际，低至75折。AWS 93折免费开户实名账号代冲值优惠多多微信号：monov8 飞机：@monov6

返回列表

上一篇：OpenPPL PPQ量化(2)：离线静态量化源码剖析

下一篇：【C++修炼之路】C++入门（中）—— 函数重载和引用