大话AI绘画技术原理与算法优化

引子

如标题所示，本篇博文主要把近段时间的研究工作做一个review。

也确实没有多少人能把一些技术细节用一些比较通俗的语言阐述清楚。

如何学习以及相关资源

而摆在眼前的确是一条臭水沟。

Hardware: 32 x 8 x A100 GPUs

Optimizer: AdamW
Gradient Accumulations: 2
Batch: 32 x 8 x 2 x 4 = 2048
Learning rate: warmup to 0.0001 for 10,000 steps and then kept constant

Hardware Type: A100 PCIe 40GB
Hours used: 150000
Cloud Provider: AWS
Compute Region: US-east
Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid: 11250 kg CO2 eq。

该模型是在亚马逊云计算服务上使用256个NVIDIA A100 GPU训练，共花费15万个GPU小时，成本为60万美元

这个数据就是一个劝退警告，但是由于效果太过于“吓人”，所以飞蛾扑火，全世界都打起架来了。

随着这个领域的爆火，各种资源爆炸式增长。

相关整合资源:

第三方:

huggingface/diffusers: 🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch (github.com

官方:

Stability-AI/stablediffusion: High-Resolution Image Synthesis with Latent Diffusion Models (github.com

Denoising Diffusion Implicit Models (keras.io

Stable Diffusion的基本原理

只做一个大话概览阐述，便于快速入门了解。

了解原理，仅仅阅读代码肯定是远远不够的，

官方实现，那个代码仓库真是一座屎山，乱七八糟的。

keras-cv/keras_cv/models/stable_diffusion

1.文案编码器：

负责对输入的文字进行特征编码，用于引导 diffusion 模型进行内容生成。

keras-cv/image_encoder.py

3.潜在特征解码器：

将潜在特征解码成图片

keras-cv/diffusion_model.py

用通俗的话来说编码器就是压缩，解码器就是解压，diffusion 模型就是编辑压缩的信息，而文案是引导编辑的方向。

keras-cv/clip_tokenizer.py

简而言之就是将文字转换成数字。

噪声规划，主要用于训练和合成的加噪和去噪的比率计算。

keras-cv/stable_diffusion.py

文字生成图片:

图片生成图片:

这是主流的两种做法，还有通过mask进行内容修复，以及通过其他模块辅助生成的，例如生成小姐姐之类等等。

你就会发现涉及的技术并不简单。

一句话描述：这是一个信号压缩和解压的算法。

而在正向和逆向中都可以注入噪声或者改变噪声来达到信号引导重建。

我们知道:

位图文件bmp进行余弦变换就可以压缩成jpg

你答对了，是可以的。

我们回到主题上来，基于噪声建模之后，有什么好处，这样做了之后故事变得更加有意思的。

而Stable Diffusion 就是这样一个技术方案。

生成使用的时候通过diffusion_model进行逆向扩散编辑解压，

我知道你有疑问了，那是不是在生成的时候可以在任意时候插入信息，或者是正向扩散和逆向扩散混合着来。

图片生成图片，就是在逆向扩散过程中插入一个我们预设的图片信息节点与噪声进行混合，然后通过文字语义引导合成的过程。

假设把全世界的电脑连接起来，然后采用扩散分治，那完全可以让世界上任何一个地区任何一个人的一台机器，帮你把某个节点的扩散编辑给算了。

就问一个问题，现在上船还来得及吗？

关键算法以及相关优化

答案是：跨模态注意力。

[1706.03762] Attention Is All You Need (arxiv.org

transformer架构的提出，给出不同维度数据建模的可行性。

一个值得关注的技术github:

lucidrains是一个非常勤奋且高产的人，几乎所有第三方的transformer实现都有他的影子，当然也包括stable_diffusion。

针对stable_diffusion 的算法优化，各大厂也是勤奋的。

英特尔

高通

苹果

谷歌

当然我们这里讲的优化指的是使用阶段的优化，并不是训练阶段的优化。

而是在stable_diffusion的基础上进行二次训练微调。

训练优化展开一时半会也讲不完，所以博主着重讲一下，使用阶段的优化。

Optimizations · AUTOMATIC1111/stable-diffusion-webui Wiki (github.com

commandline argument	explanation
`--opt-sdp-attention`	Faster speeds than using xformers, only available for user who manually install torch 2.0 to their venv. (non-deterministic
`--opt-sdp-no-mem-attention`	Faster speeds than using xformers, only available for user who manually install torch 2.0 to their venv. (deterministic, slight slower than `--opt-sdp-attention`
`--xformers`	Use xformers library. Great improvement to memory consumption and speed. Will only be enabled on small subset of configuration because that's what we have binaries for. Documentation
`--force-enable-xformers`	Enables xformers above regardless of whether the program thinks you can run it or not. Do not report bugs you get running this.
`--opt-split-attention`	Cross attention layer optimization significantly reducing memory use for almost no cost (some report improved performance with it. Black magic. On by default for `torch.cuda`, which includes both NVidia and AMD cards.
`--disable-opt-split-attention`	Disables the optimization above.
`--opt-sub-quad-attention`	Sub-quadratic attention, a memory efficient Cross Attention layer optimization that can significantly reduce required memory, sometimes at a slight performance cost. Recommended if getting poor performance or failed generations with a hardware/software configuration that xformers doesn't work for. On macOS, this will also allow for generation of larger images.
`--opt-split-attention-v1`	Uses an older version of the optimization above that is not as memory hungry (it will use less VRAM, but will be more limiting in the maximum size of pictures you can make.
`--medvram`	Makes the Stable Diffusion model consume less VRAM by splitting it into three parts - cond (for transforming text into numerical representation, first_stage (for converting a picture into latent space and back, and unet (for actual denoising of latent space and making it so that only one is in VRAM at all times, sending others to CPU RAM. Lowers performance, but only by a bit - except if live previews are enabled.
`--lowvram`	An even more thorough optimization of the above, splitting unet into many modules, and only one module is kept in VRAM. Devastating for performance.
`*do-not-batch-cond-uncond`	Prevents batching of positive and negative prompts during sampling, which essentially lets you run at 0.5 batch size, saving a lot of memory. Decreases performance. Not a command line option, but an optimization implicitly enabled by using `--medvram` or `--lowvram`.
`--always-batch-cond-uncond`	Disables the optimization above. Only makes sense together with `--medvram` or `--lowvram`
`--opt-channelslast`	Changes torch memory type for stable diffusion to channels last. Effects not closely studied.
`--upcast-sampling`	For Nvidia and AMD cards normally forced to run with `--no-half`, should improve generation speed.

除了显存占用外的优化也有几条路子可以走。

例如：

2.量化精度，采用fp16或更低的精度求近似解

3.微调蒸馏，将模型中的某些耗时计算蒸馏

4.优化扩散采样算法，基于扩散的数学先验，加速求解

而上表中的xformers就是一个针对了transformer进行内存优化和计算优化的开源方案。

显存方面的优化主要研读以下两篇论文：

[2205.14135v2] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (arxiv.org

另外说一句，公众号文章很多水分和夸大，看看笑笑就好，不用较真。

当然除了以上提到优化方案之外还有不少思路，只是其中有一些通用性不强或者说存在特定的局限性。

落地部署的林林总总

主要涉及模型转换，精度处理，问题排查。

展开说也确实三言两语说不完。

不过还有很多工作要做才能达到在普通手机上秒级出图，路漫漫其修远兮。

发展和研究的建议

只不过训练成本不会低的，要做的工作也不少，

stable_diffusion原本的架构非常的冗余，并且很野蛮，所谓大力出奇迹。

希望有更多的朋友加入进来，一起拥抱趋势，拥抱未来。

邮箱地址是: gaozhihan@vip.qq.com

编程笔记 » 大话AI绘画技术原理与算法优化

大话AI绘画技术原理与算法优化

引子

相关文章

Hi，您需要填写昵称和邮箱！