sagan 自注意力
介绍 (Introduction)In my effort to better understand the concept of self-attention, I tried dissecting one of its particular use cases on one of my current deep learning subtopic interests: Generative Adversarial Networks (GANs). As I delved deeply into the Self-Attention GAN (or “SAGAN”) research paper, while following similar implementations on Pytorch and Tensorflow in parallel, I noticed how exhausting it could get to power through the formality and the mathematically intense blocks to arrive at a clear intuition of the paper’s contents. Although I get that formal papers are written that way for precision of language, I do think there’s a need for bite-sized versions that define the prerequisite knowledge needed and also lay down the advantages and disadvantages candidly.
为了更好地理解自我注意的概念 我尝试将其特定用例之一与我当前的深度学习子主题兴趣之一进行比较 生成对抗网络(GAN)。 当我深入研究Self-Attention GAN(或“ SAGAN”)研究论文时 在并行地在Pytorch和Tensorflow上执行类似的实现时 我注意到如何通过形式化和数学上密集的块来耗尽它会变得强大起来清楚地了解论文的内容。 尽管我认为正式论文的编写方式是为了提高语言的精确度 但我确实认为需要一口大小的版本来定义所需的先决知识 并坦率地列出优缺点 。
In this article, I am going to try to make a computationally efficient interpretation of the SAGAN without reducing too much of the accuracy for the “hacky” people out there who want to just get started (Wow, so witty).
在本文中 我将尝试对SAGAN进行计算上有效的解释 而又不会降低那些刚开始使用“ hacky”的人的准确性(哇 很机智)。
So, here’s how I’m going to do it:
所以 这就是我要做的事情
What do I need to know?
我需要知道些什么
What is it? Who made it?
它是什么 谁干的
What does it solve? Advantages and Disadvantages?
它能解决什么 的优点和缺点
Possible further studies?
可能需要进一步研究吗
Source/s
源/秒
我需要知道些什么 (What do I need to know?) Basic Machine Learning and Deep Learning concepts (Dense Layers, Activation Functions, Optimizers, Backpropagation, Normalization, etc.) 基本机器学习和深度学习概念(密集层 激活功能 优化器 反向传播 规范化等) Vanilla GAN 甘香草 Other GANs: Deep Convolutional GAN (DCGAN), Wasserstein GANs (WGAN) 其他GAN 深度卷积GAN(DCGAN) Wasserstein GAN(WGAN) Convolutional Neural Networks — Intuition, Limitations and Relational Inductive Biases (Just think of this as assumptions) 卷积神经网络-直觉 局限性和关系归纳偏见(仅将其视为假设) Spectral Norms and the Power Iteration Method 谱范数和功率迭代法 Two Time-Scale Update Rule (TTUR) 两个时标更新规则(TTUR) Self-Attention 自我注意First and foremost, basic concepts are always necessary. Let’s just leave it at that, haha. Moving on, a working understanding of the game mechanics of classical GAN training would be quite handy. In practice, I think most versions of GANs now are trained with convolutional layers and a non-saturating or wasserstein loss so learning about DCGANs and WGANs are very useful. Also, the understanding that CNNs have a locality assumption are key to the usefulness of self-attention in SAGANs (or, in general). For the people who get restless without the proof (a.k.a. math nerds), it would be helpful to check out spectral norms and the power iteration method, an eigenvector approximation algorithm, beforehand. As for TTUR, honestly this is just having two separate learning rates for your generator and discriminator models. Feel free to check out the paper on Attention too even though I’ll be mildly going through it.
首先 最基本的概念总是必要的 。 我们就这样吧 哈哈。 继续 对经典GAN训练的游戏机制的工作理解将非常方便。 实际上 我认为现在大多数GAN版本都经过卷积层训练 并且具有非饱和或wasserstein损失 因此了解DCGAN和WGAN非常有用。 同样 对于CNN具有局部性假设的理解对于SAGAN(或一般而言)中自我注意的有用性至关重要。 对于那些没有证明而烦躁不安的人(又称数学书呆子) 事先检查频谱范数和功率迭代方法(特征向量近似算法)将很有帮助。 至于TTUR 说实话 这只是针对生成器和鉴别器模型的两个单独的学习率。 即使我会温和地进行阅读 也可以随时查阅有关注意的论文。
它是什么 谁干的 (What is it? Who made it?)Essentially, SAGAN is a convolutional GAN that uses a self-attention layer/block in the generator model, does spectral normalization on both the generator and discriminator, and trains via the two time-scale update rule (TTUR) and the hinge version of the adversarial loss. Everything else is common GAN practice; some of these would be using tanh function at the end of a generator model, using leaky ReLU for the discriminator and just generally using Adam as your optimizer. This architecture was created by Han Zhang, Ian Goodfellow, Dimitris Metaxas and Augustus Odena.
本质上 SAGAN是卷积GAN 它在生成器模型中使用了一个自注意层/块 对生成器和鉴别器进行了频谱归一化 并通过两个时标更新规则(TTUR)和该算法的铰链版本进行训练。对抗性损失。 一切都是GAN的惯例。 其中一些将在生成器模型的末尾使用tanh函数 使用泄漏的ReLU作为判别器 并且通常只使用Adam作为优化器。 该架构由Han Zhang Ian Goodfellow Dimitris Metaxas和Augustus Odena创建 。
If you looked through the prerequisites, this definition would be pretty straightforward.
如果您仔细研究了先决条件 则此定义将非常简单。
Hinge Version of Adversarial Loss Used in the Paper 本文中使用的对抗性损失的铰链版本 它能解决什么 的优点和缺点 (What does it solve? Advantages and Disadvantages?)To start, an attention module is something that is incorporated in your model to be able to use all of your input’s information (global access) for the output in a not so computationally expensive way. Self-attention is just a specific version wherein your query, key and value vectors are all the same. In the figure below, these are the f, g and h functions. Primarily used in NLP, it has found its way to CNNs and GANs because of the locality assumption that CNNs make. Since CNNs and previous convolution-based GANs use a small window to predict the next layer, complex geometry of certain outputs (ex. dogs, full body photos, etc.) are harder to generate as compared to pictures of oceans, skies and other backgrounds. I’ve also read that previous GANs had a harder time generating images in multi-class situations but I need to read up more on that. Now, self-attention makes it possible to have global access to input information, giving the generator the ability to learn from all feature locations.
首先 注意模块是模型中集成的模块 它能够以计算上不那么昂贵的方式将输入的所有信息(全局访问)用于输出。 自我注意只是一个特定的版本 其中您的查询 键和值向量都相同。 在下图中 这些是f g和h函数。 它主要用于NLP中 由于CNN的位置假设 它已经找到了通往CNN和GAN的途径。 由于CNN和以前的基于卷积的GAN使用一个小窗口来预测下一层 因此与海洋 天空和其他背景的图片相比 某些输出(例如狗 全身照片等)的复杂几何图形更难以生成。 我还阅读了以前的GAN 很难在多类情况下生成图像 但是我需要阅读更多内容。 现在 自我关注使全局访问输入信息成为可能 从而使生成器能够从所有特征位置进行学习。
⊗ just means matrix multiplication. The first part just shows how the previous layer is converted into three identical pieces (query, key and value) using 1x1 convolutions. ⊗仅表示矩阵乘法。 第一部分仅说明如何使用1x1卷积将上一层转换为三个相同的部分(查询 键和值)。Another thing about the SAGAN is that it uses spectral normalization on both the generator and the discriminator for better conditioning. What spectral normalization does is that it allows less discriminator updates per generator update via limiting the spectral norm of the weight matrices to constrain the Lipschitz of the network function. That’s a mouthful but you can just imagine it to be a more powerful normalization technique. Lastly, SAGANs use the two time-scale update rule to address slow learning discriminators. Typically, the the discriminator starts with a higher learning rate to avoid mode collapse.
关于SAGAN的另一件事是 它在生成器和鉴别器上都使用频谱归一化 以实现更好的调节。 频谱归一化的作用是通过限制权重矩阵的频谱范数来限制网络函数的Lipschitz 从而允许每个生成器更新执行的鉴别器更新更少。 那是一个大嘴 但是您可以想象它是一种更强大的规范化技术。 最后 SAGAN使用两个时标更新规则来解决学习缓慢的区分因素。 通常 鉴别器以较高的学习率开始 以避免模式崩溃。
可能需要进一步研究吗 (Possible further studies?)As of the moment, I’m personally having a difficult time generating 256x256 images due to either computational expense or something I don’t fully understand about the capacity or nuances of the model. Has anyone tried progressively growing a SAGAN?
目前 由于计算费用或我对模型的容量或细微之处尚不完全了解 因此我个人很难生成256x256图像。 有没有人尝试逐步发展SAGAN
Thanks for reading! I hope you enjoyed! I would love to do more of these so feedback is very much welcome. :)
谢谢阅读 希望您喜欢 我想做更多的事情 因此非常欢迎反馈。 :)
源/秒 (Source/s)Self-Attention GAN Paper
自我注意GAN论文
Spectral Normalization for GANs
GAN的光谱归一化
本文链接: http://amsolv.immuno-online.com/view-742379.html