Deep Learning

Notation

\(\mathbf{X}\) : input tensor or token sequence
\(\mathbf{H}\) : hidden representation
\(\mathbf{Z}\) : latent or token embedding sequence
\(\hat{y}\) : model prediction
\(\mathcal{L}\) : training objective / loss
\(\mathbf{I}\) : identity matrix
\(\mathcal{N}(\mathbf{x}; \mu, \Sigma)\) : Gaussian distribution with mean \(\mu\) and covariance \(\Sigma\)
\(\epsilon_\theta\) : learned noise predictor
\(\varnothing\) : null / unconditional conditioning token
\(\mathbf{x}_t\) : noisy sample or state at step \(t\)
\(\alpha_t, \beta_t, \bar{\alpha}_t\) : diffusion schedule terms

Autoregressive Modeling

\[p(x_{1:T}) = \prod_{t=1}^{T} p(x_t \mid x_{<t})\] \[\mathcal{L}_{\mathrm{NLL}} = - \sum_{t=1}^{T} \log p_\theta(x_t \mid x_{<t})\] \[\mathcal{L}_{\mathrm{CE}} = - \sum_{i=1}^{V} y_i \log \hat{y}_i\] \[\mathrm{PPL} = \exp \left( \frac{1}{T} \mathcal{L}_{\mathrm{NLL}} \right)\]

Normalization

\[\mathrm{LayerNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x} - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta\] \[\mathrm{RMSNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x}}{\sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2 + \epsilon}}\]

Feed Forward

\[\mathrm{FFN}(\mathbf{x}) = \phi(\mathbf{x}\mathbf{W}_1 + \mathbf{b}_1)\mathbf{W}_2 + \mathbf{b}_2\] \[\mathrm{SwiGLU}(\mathbf{x}) = \left(\mathrm{SiLU}(\mathbf{x}\mathbf{W}_1) \odot \mathbf{x}\mathbf{W}_2\right)\mathbf{W}_3\] \[\mathrm{SiLU}(x) = x \, \sigma(x)\]

Activations

\[\sigma(x) = \frac{1}{1 + e^{-x}}\] \[\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\] \[\mathrm{ReLU}(x) = \max(0, x)\] \[\mathrm{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}\]