Deep Learning
Notation
- \(\mathbf{X}\) : input tensor or token sequence
- \(\mathbf{H}\) : hidden representation
- \(\mathbf{Z}\) : latent or token embedding sequence
- \(\hat{y}\) : model prediction
- \(\mathcal{L}\) : training objective / loss
- \(\mathbf{I}\) : identity matrix
- \(\mathcal{N}(\mathbf{x}; \mu, \Sigma)\) : Gaussian distribution with mean \(\mu\) and covariance \(\Sigma\)
- \(\epsilon_\theta\) : learned noise predictor
- \(\varnothing\) : null / unconditional conditioning token
- \(\mathbf{x}_t\) : noisy sample or state at step \(t\)
- \(\alpha_t, \beta_t, \bar{\alpha}_t\) : diffusion schedule terms
Autoregressive Modeling
\[p(x_{1:T}) = \prod_{t=1}^{T} p(x_t \mid x_{<t})\]
\[\mathcal{L}_{\mathrm{NLL}} = - \sum_{t=1}^{T} \log p_\theta(x_t \mid x_{<t})\]
\[\mathcal{L}_{\mathrm{CE}} = - \sum_{i=1}^{V} y_i \log \hat{y}_i\]
\[\mathrm{PPL} = \exp \left( \frac{1}{T} \mathcal{L}_{\mathrm{NLL}} \right)\]
Normalization
\[\mathrm{LayerNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x} - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta\]
\[\mathrm{RMSNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x}}{\sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2 + \epsilon}}\]
Feed Forward
\[\mathrm{FFN}(\mathbf{x}) = \phi(\mathbf{x}\mathbf{W}_1 + \mathbf{b}_1)\mathbf{W}_2 + \mathbf{b}_2\]
\[\mathrm{SwiGLU}(\mathbf{x}) = \left(\mathrm{SiLU}(\mathbf{x}\mathbf{W}_1) \odot \mathbf{x}\mathbf{W}_2\right)\mathbf{W}_3\]
\[\mathrm{SiLU}(x) = x \, \sigma(x)\]
Activations
\[\sigma(x) = \frac{1}{1 + e^{-x}}\]
\[\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\]
\[\mathrm{ReLU}(x) = \max(0, x)\]
\[\mathrm{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}\]