Skip to content
Miko-pedia Miko-pedia
← マップに戻る
機械学習 · AIのノート · 学部 · #ml

Machine Learning

A concise guide to machine learning.

· 20 分

このノートの改善点や、新しく欲しいノートを提案できます。

PDFを見る

Overview. This complete note surveys the foundations and practice of machine learning at the undergraduate level. It moves from the conceptual framing of learning problems (data, loss, generalization) through mathematical foundations, classical methods, deep learning, generative models, and reinforcement learning, concluding with practical deployment and ethics. Each section contains definitions, algorithms, key equations, and review guidance. The note is designed as a comprehensive review resource: study it holistically or jump to individual sections for targeted revision.

The Questions of ML: Data, Loss, Generalization, and Evaluation

Core ideas

Machine learning (ML) constructs algorithms that improve on task TT, with respect to performance metric PP, based on experience EE. In statistical ML, this is formalized as selecting a function from a hypothesis class to minimize expected error.

Problem Setup: [nosep]

  • Data: Drawn from an unknown joint distribution D\mathcal{D} over inputs xX\mathbf{x} \in \mathcal{X} and targets yYy \in \mathcal{Y}. A training set S={(x(i),y(i))}i=1nS = \{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^n is sampled i.i.d.\ from D\mathcal{D}.
  • **Hypothesis Class (\mathcal{H**)}: The set of all possible models fθ:XYf_\theta : \mathcal{X} \to \mathcal{Y} the learning algorithm can consider (e.g., all linear functions, or neural networks of a fixed architecture).
  • Loss Function: (f(x),y)R0\ell(f(\mathbf{x}), y) \in \mathbb{R}_{\ge 0} measures the cost of a prediction.

Risk and Generalization: [nosep]

  • Expected Risk: The true objective, R(f)=E(x,y)D[(f(x),y)]R(f) = \mathbb{E}_{(\mathbf{x},y)\sim \mathcal{D}}[\ell(f(\mathbf{x}), y)]. Since D\mathcal{D} is unknown, we cannot directly minimize R(f)R(f).
  • Empirical Risk Minimization (ERM): We approximate R(f)R(f) using the training data: R^S(f)=1ni=1n(f(x(i)),y(i))\hat{R}_S(f) = \frac{1}{n}\sum_{i=1}^n \ell(f(\mathbf{x}^{(i)}), y^{(i)}). ERM selects f^=argminfHR^S(f)\hat{f} = \arg\min_{f \in \mathcal{H}} \hat{R}_S(f).
  • Generalization Gap: The difference R(f)R^S(f)R(f) - \hat{R}_S(f). A large gap indicates overfitting (low empirical risk but high expected risk, often due to high capacity in H\mathcal{H}). Conversely, underfitting occurs if H\mathcal{H} is too restrictive, leading to high training and test errors.

Learning paradigms: [nosep]

  • Supervised: Learn f:XYf: \mathcal{X} \to \mathcal{Y} given fully labeled pairs.
  • Unsupervised: Discover structure (clusters, manifolds, density) from unlabeled data {x(i)}i=1n\{\mathbf{x}^{(i)}\}_{i=1}^n.
  • Self-supervised: Formulate supervised tasks from unlabeled data (e.g., predicting masked words or geometric transformations).
  • Reinforcement learning (RL): Learn a policy π(as)\pi(\mathbf{a}|\mathbf{s}) to maximize cumulative rewards in sequential environments.

Evaluation Metrics: Held-out test sets estimate R(f)R(f). Common empirical metrics include: Accuracy, Precision/Recall, and F1 for classification; Mean Squared Error (MSE) and Mean Absolute Error (MAE) for regression; and Area Under the ROC Curve (AUC-ROC) for binary confidence scores.

For review, be able to: define task, performance, and experience; contrast expected and empirical risk; explain the ERM principle and the role of the hypothesis class H\mathcal{H}; and classify tasks into paradigms.

Section summary. ML formalizes learning as optimization: we seek a hypothesis fHf \in \mathcal{H} that minimizes empirical risk on training data while generalizing to the true distribution. Controlling the complexity of H\mathcal{H} to balance underfitting and overfitting is the central challenge.

Mathematical Foundations: Linear Algebra, Calculus, Probability, and Optimization

Core ideas

Linear algebra provides the language for representing data and models. [nosep]

  • A vector xRd\mathbf{x} \in \mathbb{R}^d is an ordered dd-tuple; the inner product wx=j=1dwjxj\mathbf{w}^\top\mathbf{x} = \sum_{j=1}^d w_j x_j measures similarity.
  • A matrix XRn×d\mathbf{X} \in \mathbb{R}^{n \times d} represents nn data points each with dd features.
  • The transpose (X)ij=Xji(\mathbf{X}^\top)_{ij} = \mathbf{X}_{ji}; the identity Id\mathbf{I}_d; the inverse A1\mathbf{A}^{-1} satisfies A1A=I\mathbf{A}^{-1}\mathbf{A} = \mathbf{I}.
  • Eigendecomposition: Av=λv\mathbf{A}\mathbf{v} = \lambda \mathbf{v} for square A\mathbf{A}. Symmetric positive-definite matrices have orthogonal eigenvectors and positive eigenvalues.
  • Singular value decomposition (SVD): A=UΣV\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top for any ARm×n\mathbf{A} \in \mathbb{R}^{m \times n}, where U,V\mathbf{U},\mathbf{V} are orthogonal and Σ\boldsymbol{\Sigma} is diagonal with singular values σ1σmin(m,n)0\sigma_1 \ge \cdots \ge \sigma_{\min(m,n)} \ge 0.
  • Norms: L2L^2 norm x2=jxj2\|\mathbf{x}\|_2 = \sqrt{\sum_j x_j^2}; L1L^1 norm x1=jxj\|\mathbf{x}\|_1 = \sum_j |x_j|; Frobenius norm AF=ijAij2\|\mathbf{A}\|_F = \sqrt{\sum_{ij} A_{ij}^2}.

Calculus enables optimization. [nosep]

  • The gradient wf(w)\nabla_\mathbf{w} f(\mathbf{w}) is the vector of partial derivatives; it points in the direction of steepest ascent.
  • The chain rule: if z=g(f(w))z = g(f(\mathbf{w})), then wz=dgdfwf\nabla_\mathbf{w} z = \frac{dg}{df} \cdot \nabla_\mathbf{w} f.
  • Hessian Hij=2fwiwj\mathbf{H}_{ij} = \frac{\partial^2 f}{\partial w_i \partial w_j} encodes curvature; positive-definite Hessian     \implies local minimum.
  • Taylor expansion: f(w+δ)f(w)+f(w)δ+12δHδf(\mathbf{w}+\boldsymbol{\delta}) \approx f(\mathbf{w}) + \nabla f(\mathbf{w})^\top\boldsymbol{\delta} + \frac12 \boldsymbol{\delta}^\top \mathbf{H} \boldsymbol{\delta}.

Probability quantifies uncertainty. [nosep]

  • Random variable XX with distribution P(X)P(X); expectation E[X]=xxP(x)\mathbb{E}[X] = \sum_x x P(x) (discrete) or xp(x)dx\int x p(x)\,dx (continuous).
  • Conditional probability P(YX)=P(X,Y)/P(X)P(Y|X) = P(X,Y)/P(X); Bayes’ rule: P(YX)=P(XY)P(Y)/P(X)P(Y|X) = P(X|Y)P(Y)/P(X).
  • Variance Var(X)=E[(XE[X])2]\text{Var}(X) = \mathbb{E}[(X-\mathbb{E}[X])^2]; covariance Cov(X,Y)=E[(XE[X])(YE[Y])]\text{Cov}(X,Y) = \mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])].
  • Common distributions: Bernoulli(μ)(\mu), Binomial(n,μ)(n,\mu), Gaussian N(μ,σ2)\mathcal{N}(\mu,\sigma^2), Multinomial, Exponential.
  • Law of large numbers: sample mean converges to expectation. Central limit theorem: sum of i.i.d.\ variables is approximately Gaussian.

Optimization finds parameters minimizing a loss. [nosep]

  • Convex functions: f(λw1+(1λ)w2)λf(w1)+(1λ)f(w2)f(\lambda\mathbf{w}_1+(1-\lambda)\mathbf{w}_2) \le \lambda f(\mathbf{w}_1) + (1-\lambda) f(\mathbf{w}_2). Convex problems have no local minima other than the global minimum.
  • Gradient descent: wt+1=wtηf(wt)\mathbf{w}_{t+1} = \mathbf{w}_t - \eta\nabla f(\mathbf{w}_t) where η\eta is the learning rate.
  • Stochastic gradient descent (SGD): uses a minibatch estimate of the gradient for efficiency.
  • First-order methods (SGD, momentum, Adam) scale to high dimensions; second-order methods (Newton) use curvature but are costly.

For review, be able to: compute gradients via the chain rule; apply Bayes’ rule; derive the normal equations; explain SVD; implement gradient descent; and recognize convex functions.

Section summary. ML rests on linear algebra (representing data and models), calculus (optimizing via gradients), probability (modeling noise and uncertainty), and optimization (finding parameters). The normal equations solve linear regression analytically; for most models we use iterative gradient-based methods.

Linear Models: Regression, Classification, and Regularization

Core ideas

Linear regression models the target as a continuous function f(x)=wx+bf(\mathbf{x}) = \mathbf{w}^\top\mathbf{x} + b. Under squared error loss, the empirical risk is:

L(w,b)=12ni=1n(wx(i)+by(i))2.\mathcal{L}(\mathbf{w}, b) = \frac{1}{2n}\sum_{i=1}^n (\mathbf{w}^\top\mathbf{x}^{(i)} + b - y^{(i)})^2.

By appending 11 to each x(i)\mathbf{x}^{(i)} to absorb bb, the closed-form normal equations yield w=(XX)1Xy\mathbf{w}^* = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}. This requires XX\mathbf{X}^\top\mathbf{X} to be invertible (full column rank).

Logistic regression is for binary classification, modeling the probability of the positive class using the logistic sigmoid function σ(z)=1/(1+ez)\sigma(z) = 1/(1+e^{-z}):

P(y=1x)=y^=σ(wx+b).P(y=1|\mathbf{x}) = \hat{y} = \sigma(\mathbf{w}^\top\mathbf{x} + b).

Log-odds interpretation: It models the log-odds as a linear function: log(y^1y^)=wx+b\log(\frac{\hat{y}}{1-\hat{y}}) = \mathbf{w}^\top\mathbf{x} + b. Optimized via gradient descent or Newton’s method using the binary cross-entropy (log-loss):

L(w)=1ni=1n[y(i)logy^(i)+(1y(i))log(1y^(i))].\mathcal{L}(\mathbf{w}) = -\frac{1}{n}\sum_{i=1}^n \bigl[y^{(i)}\log\hat{y}^{(i)} + (1-y^{(i)})\log(1-\hat{y}^{(i)})\bigr].

Softmax regression (multinomial logistic regression) generalizes to KK classes:

P(y=kx)=y^k=exp(wkx)j=1Kexp(wjx).P(y=k|\mathbf{x}) = \hat{y}_k = \frac{\exp(\mathbf{w}_k^\top\mathbf{x})}{\sum_{j=1}^K \exp(\mathbf{w}_j^\top\mathbf{x})}.

Loss is the categorical cross-entropy: L(W)=1ni=1nk=1K1[y(i)=k]logy^k(i)\mathcal{L}(\mathbf{W}) = -\frac{1}{n}\sum_{i=1}^n \sum_{k=1}^K \mathbb{1}[y^{(i)}=k] \log\hat{y}^{(i)}_k.

Generalized Linear Models (GLMs) unify these approaches by assuming the target yy follows a distribution from the Exponential Family (e.g., Gaussian for linear regression, Bernoulli for logistic, Poisson for counts) and modeling the natural parameter as a linear combination of features η=wx\eta = \mathbf{w}^\top\mathbf{x}.

Regularization prevents overfitting by penalizing large weights: [nosep]

  • L2L^2 (Ridge): Penalty λw22\lambda\|\mathbf{w}\|_2^2. Closed-form: w=(XX+λI)1Xy\mathbf{w}^* = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}. Shrinks all weights.
  • L1L^1 (Lasso): Penalty λw1\lambda\|\mathbf{w}\|_1. Induces sparsity (some weights become exactly zero), acting as feature selection.
  • Elastic Net: Combines L1L^1 and L2L^2 penalties αw1+(1α)w22/2\alpha\|\mathbf{w}\|_1 + (1-\alpha)\|\mathbf{w}\|_2^2/2.

Assumptions: Ordinary least squares assumes E[yx]\mathbb{E}[y|\mathbf{x}] is linear, errors are i.i.d.\ homoscedastic (constant variance) and normally distributed, and no perfect multicollinearity.

For review, be able to: derive the normal equations; explain the log-odds interpretation of logistic regression; formulate linear and logistic regression as GLMs; compare L1L^1 vs L2L^2 regularization geometrically; and diagnose multicollinearity.

Section summary. Linear models are the simplest parametric models: regression via least squares, classification via logistic/softmax with cross-entropy. They can be unified under GLMs. Regularization controls overfitting by penalizing weight magnitudes. These models are strictly convex and serve as fundamental baselines.

Evaluation: Cross-Validation, Overfitting, and Bias-Variance Tradeoff

Core ideas

Overfitting occurs when a model learns noise in the training data, achieving low training error but high test error. It happens with: [nosep]

  • Too many parameters relative to sample size (dnd \gg n).
  • Training too long (over-optimization).
  • No regularization or too much capacity.

Underfitting occurs when the model is too simple to capture the underlying pattern.

Bias—variance decomposition (for squared error):

E[(f^(x)y)2]=(E[f^(x)]f(x))2Bias2+E[(f^(x)E[f^(x)])2]Variance+E[(yf(x))2]Irreducible error.\mathbb{E}[(\hat{f}(\mathbf{x}) - y)^2] = \underbrace{(\mathbb{E}[\hat{f}(\mathbf{x})] - f^*(\mathbf{x}))^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}[(\hat{f}(\mathbf{x}) - \mathbb{E}[\hat{f}(\mathbf{x})])^2]}_{\text{Variance}} + \underbrace{\mathbb{E}[(y - f^*(\mathbf{x}))^2]}_{\text{Irreducible error}}.

Bias measures how much the model’s average prediction deviates from the truth. Variance measures how much predictions fluctuate across different training sets. Simple models have high bias, low variance; complex models have low bias, high variance. The tradeoff guides model selection.

Cross-validation estimates generalization without wasting data: [nosep]

  • kk-fold CV: split data into kk folds, train on k1k-1, test on the held-out fold, repeat kk times, average metrics. Typical k=5k = 5 or 1010.
  • Leave-one-out CV (LOOCV): k=nk = n, expensive but nearly unbiased.
  • Stratified CV: preserves class proportions in each fold.
  • Train/validation/test split: hold out a test set at the very end, use validation set(s) for model selection.

Hyperparameter tuning: model capacity (degree of polynomial, kk in kk-NN, λ\lambda in ridge), learning rate, network architecture. Use validation performance (not test performance) for selection.

Learning curves: plot training and validation error vs.\ dataset size or training epochs. Gaps indicate overfitting; convergence to high error indicates underfitting.

Diagnostics: [nosep]

  • Training error \ll validation error     \implies overfitting (increase regularization, get more data, reduce capacity).
  • Both errors high     \implies underfitting (increase capacity, reduce regularization, add features).

For review, be able to: derive the bias—variance decomposition; explain why cross-validation is needed; choose kk in kk-fold CV; interpret learning curves; and design a model selection pipeline.

Section summary. Generalization is the central concern of ML. The bias—variance tradeoff explains why we need regularized, capacity-controlled models. Cross-validation provides a reliable estimate of test performance for model selection. Always keep a separate test set for final evaluation.

Classical ML: Trees, SVMs, PCA, Clustering, and Ensembles

Core ideas

Decision trees recursively partition the feature space into hyper-rectangles: [nosep]

  • Splitting criterion: At each node, select a feature jj and threshold tt to maximize information gain (decrease in impurity): IG=H(parent)vDvDH(Dv)IG = H(\text{parent}) - \sum_{v} \frac{|D_v|}{|D|} H(D_v).
  • Impurity measures (HH): Entropy kpklogpk-\sum_k p_k \log p_k or Gini impurity 1kpk21 - \sum_k p_k^2 for classification; Mean Squared Error (MSE) for regression.
  • Regularization: Trees easily overfit. Mitigated by setting max depth, minimum samples per leaf, or cost-complexity pruning.

Support vector machines (SVMs) find the maximum-margin separating hyperplane. [nosep]

  • Primal formulation: minw,b12w22+Ci=1nξi\min_{\mathbf{w},b} \frac12\|\mathbf{w}\|_2^2 + C \sum_{i=1}^n \xi_i subject to y(i)(wx(i)+b)1ξiy^{(i)}(\mathbf{w}^\top\mathbf{x}^{(i)} + b) \ge 1 - \xi_i, where ξi0\xi_i \ge 0 are slack variables for non-separable data (soft-margin). CC controls the margin vs.\ misclassification penalty.
  • Dual formulation: maxαiαi12i,jαiαjy(i)y(j)x(i)x(j)\max_{\boldsymbol{\alpha}} \sum_i \alpha_i - \frac12 \sum_{i,j} \alpha_i\alpha_j y^{(i)}y^{(j)}\mathbf{x}^{(i)\top}\mathbf{x}^{(j)}, with 0αiC0 \le \alpha_i \le C. The solution depends only on support vectors (points where αi>0\alpha_i > 0).
  • Kernel trick: By the Representer Theorem, the optimal weights lie in the span of the data: w=iαiy(i)ϕ(x(i))\mathbf{w} = \sum_i \alpha_i y^{(i)} \phi(\mathbf{x}^{(i)}). We can replace inner products with a kernel function k(x,x)=ϕ(x),ϕ(x)k(\mathbf{x}, \mathbf{x}') = \langle\phi(\mathbf{x}),\phi(\mathbf{x}')\rangle. Common kernels: Polynomial (xx+c)d(\mathbf{x}^\top\mathbf{x}' + c)^d, RBF exp(γxx22)\exp(-\gamma\|\mathbf{x}-\mathbf{x}'\|_2^2).

PCA (principal component analysis) finds an orthogonal projection of data onto a lower dimensional linear space that maximizes variance. [nosep]

  • Center the data: Xc=Xμ\mathbf{X}_c = \mathbf{X} - \boldsymbol{\mu}.
  • Compute SVD: Xc=UΣV\mathbf{X}_c = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top; columns of V\mathbf{V} are the principal components (eigenvectors of the covariance matrix XcXc\mathbf{X}_c^\top\mathbf{X}_c).
  • Projection: Z=XcVk\mathbf{Z} = \mathbf{X}_c\mathbf{V}_k maps into Rk\mathbb{R}^k.
  • Often used as an unsupervised preprocessing step to reduce dimensionality and mitigate the curse of dimensionality.

Clustering partitions data into distinct groups: [nosep]

  • kk-means: Minimizes within-cluster variance i=1nk=1K1[zi=k]x(i)μk22\sum_{i=1}^n \sum_{k=1}^K \mathbb{1}[z_i = k] \|\mathbf{x}^{(i)} - \boldsymbol{\mu}_k\|_2^2. Alternates between assigning points to the nearest centroid and recomputing centroids (Lloyd’s algorithm). Highly sensitive to initialization; use kk-means++.
  • DBSCAN: Density-based, groups points closely packed together, robust to noise/outliers, automatically determines KK.

Ensemble methods combine multiple base models to improve generalization: [nosep]

  • Bagging (Bootstrap Aggregating): Train independent high-variance models (e.g., deep decision trees) on bootstrap samples and average them. It significantly reduces variance without increasing bias. Random Forest improves this by also using random feature subsets at each split to de-correlate trees.
  • Boosting: Sequentially train high-bias models (e.g., shallow trees/stumps), where each new model corrects the errors of the ensemble so far. It primarily reduces bias. Gradient Boosting Machines (XGBoost, LightGBM) fit new trees to the negative gradient of the loss function.

For review, be able to: compute information gain for a split; derive the SVM dual; explain the kernel trick and Representer Theorem; perform PCA via SVD; run kk-means; and analytically compare how bagging and boosting affect bias and variance.

Section summary. Classical ML offers a mathematically grounded toolkit: decision trees (interpretable non-linear splits), SVMs (maximum-margin with kernels for infinite-dimensional feature spaces), PCA (unsupervised dimensionality reduction), and clustering. Ensembles (Random Forests, Gradient Boosting) dominate tabular data by systematically trading off bias and variance.

Statistical Learning: Estimation, Bayes, Causality, and Uncertainty

Core ideas

Statistical estimation: we observe data D={(x(i),y(i))}i=1n\mathcal{D} = \{(x^{(i)}, y^{(i)})\}_{i=1}^n i.i.d.\ from P(X,Y)P(X,Y). We estimate parameters θ\boldsymbol{\theta} of a model p(yx;θ)p(y|\mathbf{x};\boldsymbol{\theta}).

Maximum likelihood estimation (MLE):

θ^MLE=argmaxθi=1nlogp(y(i)x(i);θ).\hat{\boldsymbol{\theta}}_{\text{MLE}} = \operatorname*{argmax}_{\boldsymbol{\theta}} \sum_{i=1}^n \log p(y^{(i)}|\mathbf{x}^{(i)}; \boldsymbol{\theta}).

MLE is consistent (converges to true θ\boldsymbol{\theta}^* as nn\to\infty) and asymptotically efficient (achieves Cram’{e}r—Rao lower bound).

Maximum a posteriori (MAP) adds a prior p(θ)p(\boldsymbol{\theta}):

θ^MAP=argmaxθi=1nlogp(y(i)x(i);θ)+logp(θ).\hat{\boldsymbol{\theta}}_{\text{MAP}} = \operatorname*{argmax}_{\boldsymbol{\theta}} \sum_{i=1}^n \log p(y^{(i)}|\mathbf{x}^{(i)}; \boldsymbol{\theta}) + \log p(\boldsymbol{\theta}).

L2L^2 regularization on weights corresponds to a Gaussian prior; L1L^1 corresponds to a Laplace prior.

Bayesian inference computes the full posterior:

p(θD)=p(Dθ)p(θ)p(D)p(Dθ)p(θ).p(\boldsymbol{\theta}|\mathcal{D}) = \frac{p(\mathcal{D}|\boldsymbol{\theta})p(\boldsymbol{\theta})}{p(\mathcal{D})} \propto p(\mathcal{D}|\boldsymbol{\theta})p(\boldsymbol{\theta}).

Prediction: p(yx,D)=p(yx,θ)p(θD)dθp(y^*|\mathbf{x}^*,\mathcal{D}) = \int p(y^*|\mathbf{x}^*,\boldsymbol{\theta}) p(\boldsymbol{\theta}|\mathcal{D}) d\boldsymbol{\theta} (posterior predictive). For most models the integral is intractable; approximation via MCMC (sampling) or variational inference (optimization).

Causality goes beyond correlation. Key concepts: [nosep]

  • Structural causal models (SCMs): Xi=fi(PAi,Ui)X_i = f_i(\mathbf{PA}_i, U_i) where PAi\mathbf{PA}_i are parents of XiX_i and UiU_i are exogenous noise.
  • Interventions do(Xj=x)do(X_j = x) vs.\ conditioning P(YXj=x)P(Y|X_j=x). The do-operator removes incoming edges to XjX_j.
  • Confounders: ZZ affects both XX and YY; conditioning on ZZ blocks the backdoor path.
  • Randomized controlled trials (RCT) are the gold standard for causal inference.
  • Methods: propensity score matching, instrumental variables, difference-in-differences.

Uncertainty quantification: [nosep]

  • Aleatoric uncertainty: irreducible noise in data (e.g., measurement error). Cannot be reduced with more data.
  • Epistemic uncertainty: uncertainty about model parameters due to limited data. Decreases with more data.
  • Calibration: a model is calibrated if among predictions with confidence pp, the true fraction correct is pp.

For review, be able to: derive MLE for Gaussian and Bernoulli; explain MAP as regularized MLE; describe Bayesian vs.\ frequentist approaches; define confounding; distinguish aleatoric and epistemic uncertainty.

Section summary. Statistical learning provides the probabilistic foundation: MLE for point estimates, Bayesian inference for full uncertainty, and causal reasoning for interventional questions. Understanding estimation properties (bias, consistency, efficiency) is essential for diagnosing models.

Probabilistic ML: Generative Models, Latent Variables, and Graphical Models

Core ideas

Generative vs.\ discriminative models: [nosep]

  • Discriminative: p(yx)p(y|\mathbf{x}), e.g., logistic regression, neural networks.
  • Generative: p(x,y)=p(xy)p(y)p(\mathbf{x}, y) = p(\mathbf{x}|y)p(y), e.g., naive Bayes, GMMs.
  • Generative models can produce new samples xp(x)\mathbf{x} \sim p(\mathbf{x}) and handle missing data naturally.
  • Naive Bayes assumes conditional independence: p(xy)=j=1dp(xjy)p(\mathbf{x}|y) = \prod_{j=1}^d p(x_j|y). Despite the strong assumption, it works well for text classification.

Gaussian mixture models (GMMs):

p(x)=k=1KπkN(xμk,Σk),k=1Kπk=1, πk0.p(\mathbf{x}) = \sum_{k=1}^K \pi_k\,\mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k), \quad \sum_{k=1}^K \pi_k = 1,\ \pi_k \ge 0.

Latent variable zi{1,,K}z_i \in \{1,\dots,K\} indicates which component generated x(i)\mathbf{x}^{(i)}. The log-likelihood is non-convex; use expectation—maximization (EM): [nosep]

  • E-step: compute responsibilities γik=P(zi=kx(i),θ(t))\gamma_{ik} = P(z_i=k|\mathbf{x}^{(i)}, \boldsymbol{\theta}^{(t)}).
  • M-step: update πk=1niγik\pi_k = \frac{1}{n}\sum_i \gamma_{ik}, μk=iγikx(i)iγik\boldsymbol{\mu}_k = \frac{\sum_i \gamma_{ik}\mathbf{x}^{(i)}}{\sum_i \gamma_{ik}}, Σk=iγik(x(i)μk)(x(i)μk)iγik\boldsymbol{\Sigma}_k = \frac{\sum_i \gamma_{ik}(\mathbf{x}^{(i)}-\boldsymbol{\mu}_k)(\mathbf{x}^{(i)}-\boldsymbol{\mu}_k)^\top}{\sum_i \gamma_{ik}}.

EM monotonically increases the marginal log-likelihood.

Probabilistic graphical models represent conditional independence: [nosep]

  • Bayesian networks (DAG): p(x)=j=1dp(xjpa(xj))p(\mathbf{x}) = \prod_{j=1}^d p(x_j|\mathbf{pa}(x_j)). Factorization according to graph structure.
  • Markov random fields (undirected): p(x)=1ZCCψC(xC)p(\mathbf{x}) = \frac{1}{Z}\prod_{C \in \mathcal{C}} \psi_C(\mathbf{x}_C) where ZZ is the partition function.
  • Hidden Markov models (HMMs): latent states ztz_t evolve as a Markov chain p(ztzt1)p(z_t|z_{t-1}), observations xtx_t depend on ztz_t. Forward-backward algorithm computes p(ztx1:T)p(z_t|\mathbf{x}_{1:T}); Viterbi finds the most likely sequence.

Latent variable models (LVMs) assume the observed data is generated from unobserved latent variables: [nosep]

  • Factor analysis, probabilistic PCA (x=Wz+ϵ\mathbf{x} = \mathbf{W}\mathbf{z} + \boldsymbol{\epsilon}, zN(0,I)\mathbf{z} \sim \mathcal{N}(0,\mathbf{I})).
  • Topic models (LDA): documents are mixtures of topics, topics are distributions over words.

Variational inference: approximate intractable posterior p(zx)p(\mathbf{z}|\mathbf{x}) with a simpler q(z)q(\mathbf{z}) by minimizing KL divergence:

KL(qp)=Eq[logq(z)logp(zx)]=ELBO+logp(x).\text{KL}(q\|p) = \mathbb{E}_q[\log q(\mathbf{z}) - \log p(\mathbf{z}|\mathbf{x})] = -\text{ELBO} + \log p(\mathbf{x}).

Maximizing the evidence lower bound (ELBO) L=Eq[logp(x,z)logq(z)]\mathcal{L} = \mathbb{E}_q[\log p(\mathbf{x},\mathbf{z}) - \log q(\mathbf{z})] is equivalent to minimizing KL.

For review, be able to: derive EM for GMMs; explain the difference between generative and discriminative models; factorize a joint distribution from a DAG; derive the ELBO; and implement the forward-backward algorithm.

Section summary. Probabilistic ML models the full data distribution. GMMs cluster with soft assignments via EM. Graphical models encode structure and enable efficient inference. Variational inference approximates intractable posteriors, forming the foundation for modern deep generative models.

Fundamentals of Deep Learning: MLPs, Backpropagation, and Optimization

Core ideas

Multilayer perceptrons (MLPs) (feedforward neural networks) stack affine transformations with elementwise nonlinearities to learn hierarchical representations. A network of depth LL computes:

h1=σ1(W1x+b1),hL1=σL1(WL1hL2+bL1),y^=WLhL1+bL.\mathbf{h}_1 = \sigma_1(\mathbf{W}_1\mathbf{x} + \mathbf{b}_1), \quad \dots \quad \mathbf{h}_{L-1} = \sigma_{L-1}(\mathbf{W}_{L-1}\mathbf{h}_{L-2} + \mathbf{b}_{L-1}), \quad \hat{\mathbf{y}} = \mathbf{W}_L\mathbf{h}_{L-1} + \mathbf{b}_L.

Hidden units act as learned feature extractors, circumventing the need for manual feature engineering.

Activation functions inject essential nonlinearity (without them, the network collapses to a single affine transform): [nosep]

  • ReLU (Rectified Linear Unit): σ(z)=max(0,z)\sigma(z) = \max(0, z). Piecewise linear, cheap to compute, mitigates vanishing gradients. Default choice for hidden layers.
  • Sigmoid: σ(z)=1/(1+ez)\sigma(z) = 1/(1+e^{-z}). Maps to (0,1)(0,1). Prone to vanishing gradients in saturated regions; mostly used for binary classification outputs.
  • Tanh: tanh(z)=(ezez)/(ez+ez)\tanh(z) = (e^z - e^{-z})/(e^z + e^{-z}). Maps to (1,1)(-1,1). Zero-centered, generally preferred over sigmoid for hidden units.
  • Leaky ReLU / GELU: Address the “dying ReLU” problem and provide smooth gradients, common in modern architectures like Transformers.

Backpropagation efficiently computes gradients of the loss L\mathcal{L} with respect to all parameters using the chain rule. It is a specific application of reverse-mode automatic differentiation: [nosep]

  • Forward pass: Compute and cache hidden activations h=f(h1)\mathbf{h}_\ell = f_\ell(\mathbf{h}_{\ell-1}) for =1,,L\ell=1,\dots,L, and the loss L\mathcal{L}.
  • Backward pass: Propagate the error signal backwards. Compute the gradient w.r.t.\ activations: Lh=Lh+1h+1h\frac{\partial\mathcal{L}}{\partial\mathbf{h}_\ell} = \frac{\partial\mathcal{L}}{\partial\mathbf{h}_{\ell+1}} \frac{\partial\mathbf{h}_{\ell+1}}{\partial\mathbf{h}_\ell}.
  • Parameter gradients: LW=LhhW\frac{\partial\mathcal{L}}{\partial\mathbf{W}_\ell} = \frac{\partial\mathcal{L}}{\partial\mathbf{h}_\ell} \frac{\partial\mathbf{h}_\ell}{\partial\mathbf{W}_\ell}.

Optimization algorithms train these highly non-convex objectives: [nosep]

  • SGD (Stochastic Gradient Descent): wt+1=wtηtwLB(wt)\mathbf{w}_{t+1} = \mathbf{w}_t - \eta_t \nabla_{\mathbf{w}} \mathcal{L}_B(\mathbf{w}_t), using a minibatch BB to estimate the true gradient.
  • Momentum: Accumulates an exponentially decaying moving average of past gradients to dampen oscillations and accelerate convergence across flat regions.
  • Adam: Adaptive learning rates per parameter, using running averages of the first moment (mean) and second moment (uncentered variance) of the gradients:
mt=β1mt1+(1β1)gt,vt=β2vt1+(1β2)gt2,θt+1=θtηm^tv^t+ϵ,m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t, \quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2, \quad \theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon},
where $\hat{m}_t$ and $\hat{v}_t$ are bias-corrected estimates.

Numerical considerations & Regularization: [nosep]

  • Vanishing/exploding gradients: Deep networks repeatedly multiply weight matrices, causing gradients to vanish or explode. Handled by careful initialization (Xavier/Glorot, He) and skip connections.
  • Batch Normalization: Standardizes activations within a minibatch to zero mean and unit variance, smoothing the optimization landscape and allowing higher learning rates.
  • Dropout: Randomly zeroes out neurons during training with probability pp. Acts as a regularizer by preventing feature co-adaptation, implicitly training an ensemble.

Universal Approximation Theorem: A feedforward network with at least one hidden layer of sufficient width and a non-polynomial activation function can approximate any continuous function on a compact subset of Rn\mathbb{R}^n to arbitrary precision. However, it does not guarantee that the learning algorithm can find those weights, nor that the network will generalize well.

For review, be able to: derive backpropagation for a 2-layer MLP; explain why nonlinearities are necessary; implement SGD with momentum; diagnose vanishing gradients; and describe batch normalization.

Section summary. Deep learning stacks differentiable nonlinear layers to learn representations. Backpropagation leverages the chain rule for exact gradient computation. Due to non-convexity and high dimensionality, specialized optimizers (Adam, SGD+Momentum), careful initialization, and techniques like Batch Normalization and Dropout are vital for successful training.

Modern Deep Learning: CNNs, RNNs, Attention, and Transformers

Core ideas

Convolutional Neural Networks (CNNs) exploit the inherent spatial structure of data (e.g., images) through localized filter operations. [nosep]

  • Definition (2D Convolution): Given an image II and a filter (kernel) KK, the discrete convolution is defined as (IK)(i,j)=mnI(i+m,j+n)K(m,n)(I * K)(i,j) = \sum_{m}\sum_{n} I(i+m, j+n) K(m,n). This operation shares weights across all spatial locations, a property known as weight tying.
  • Key Properties: [nosep]
  • Translation Equivariance: Shifting the input shifts the output by the same amount, i.e., f(g(x))=g(f(x))f(g(x)) = g(f(x)) where gg is a translation operator.
  • Local Connectivity: Each neuron only connects to a small local region of the input, defining its receptive field.
  • Pooling: Operations like max or average pooling aggregate local features, providing spatial downsampling and local translation invariance (robustness to small shifts).
  • Canonical Architectures: [nosep]
  • AlexNet: Pioneered deep CNNs using ReLU activations, dropout, and extensive data augmentation.
  • VGG: Standardized architecture using deep stacks of small 3×33 \times 3 convolutions.
  • ResNet: Introduced residual (skip) connections y=F(x)+xy = \mathcal{F}(x) + x to mitigate vanishing gradients in extremely deep networks.

Recurrent Neural Networks (RNNs) are designed to model sequential data by maintaining a hidden state across time steps:

ht=σ(Whhht1+Whxxt+bh),y^t=Whyht+by.\mathbf{h}_t = \sigma(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{hx}\mathbf{x}_t + \mathbf{b}_h), \quad \hat{\mathbf{y}}_t = \mathbf{W}_{hy}\mathbf{h}_t + \mathbf{b}_y.

[nosep]

  • Backpropagation Through Time (BPTT): Unrolls the RNN computation graph over time to compute gradients. A canonical failure mode of standard RNNs is the vanishing/exploding gradient problem over long sequences due to repeated multiplication of the weight matrix Whh\mathbf{W}_{hh}.

Gated RNNs address long-range dependencies by regulating information flow: [nosep]

  • Long Short-Term Memory (LSTM): Introduces a cell state ct\mathbf{c}_t regulated by forget (ft\mathbf{f}_t), input (it\mathbf{i}_t), and output (ot\mathbf{o}_t) gates:
ft=σ(Wf[ht1,xt]+bf),it=σ(Wi[ht1,xt]+bi),c~t=tanh(Wc[ht1,xt]+bc),\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{h}_{t-1},\mathbf{x}_t] + \mathbf{b}_f),\quad \mathbf{i}_t = \sigma(\mathbf{W}_i[\mathbf{h}_{t-1},\mathbf{x}_t] + \mathbf{b}_i),\quad \tilde{\mathbf{c}}_t = \tanh(\mathbf{W}_c[\mathbf{h}_{t-1},\mathbf{x}_t] + \mathbf{b}_c), ct=ftct1+itc~t,ot=σ(Wo[ht1,xt]+bo),ht=ottanh(ct).\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t,\quad \mathbf{o}_t = \sigma(\mathbf{W}_o[\mathbf{h}_{t-1},\mathbf{x}_t] + \mathbf{b}_o),\quad \mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t).
  • Gated Recurrent Unit (GRU): A simplified gated architecture merging the forget and input gates into a single update gate.

Attention Mechanisms overcome the bottleneck of compressing entire sequences into a fixed-length vector by allowing models to dynamically weigh the importance of all input elements. [nosep]

  • Definition (Scaled Dot-Product Attention): For queries Q\mathbf{Q}, keys K\mathbf{K}, and values V\mathbf{V}, attention is computed as:
Attention(Q,K,V)=softmax ⁣(QKdk)V.\text{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right)\mathbf{V}.
Scaling by $1/\sqrt{d_k}$ prevents the dot products from growing excessively large, which would push the softmax function into regions with near-zero gradients.

Transformers (Vaswani et al., 2017) rely entirely on attention, dispensing with recurrence and convolutions. [nosep]

  • Multi-Head Attention: Projects Q\mathbf{Q}, K\mathbf{K}, and V\mathbf{V} into hh different representation subspaces (heads) in parallel. The independent attention outputs are then concatenated and linearly transformed.
  • Positional Encoding: Since attention is permutation-invariant, positional encodings (e.g., fixed sinusoidal functions or learned embeddings) are added to input embeddings to inject sequence order information.
  • Architecture Components: Comprises layer normalization, residual connections, and position-wise feedforward networks. The encoder uses full self-attention, whereas the decoder uses masked self-attention to prevent attending to future tokens (preserving the autoregressive property).
  • Complexity: Standard attention has O(n2)O(n^2) time and space complexity with respect to sequence length nn.

Large-Scale Pretraining leverages massive unannotated datasets to learn general representations: [nosep]

  • BERT: Uses a bidirectional encoder pretrained via Masked Language Modeling (MLM).
  • GPT: Employs a causal, autoregressive decoder architecture.
  • Scaling Laws: Empirical observations that model performance scales power-law predictably with parameter count, dataset size, and training compute.

For review, be able to: explain the receptive field of a CNN; derive gradient flow in an RNN via BPTT; implement scaled dot-product attention; draw the standard Transformer encoder/decoder blocks; and describe masked language modeling.

Section summary. Modern deep learning is built upon specialized architectures for data modalities: CNNs utilize local connectivity and weight tying for spatial data; RNNs and LSTMs maintain hidden states for sequential data; and Transformers leverage self-attention for parallelizable, long-range sequence modeling. The paradigm has shifted strongly toward large-scale, self-supervised pretraining followed by task-specific fine-tuning.

Generative AI: VAEs, GANs, Diffusion, and Self-Supervised Learning

Core ideas

Variational Autoencoders (VAEs) combine deep learning with variational inference to learn continuous, structured latent spaces: [nosep]

  • Definition: An architecture consisting of a probabilistic encoder qϕ(zx)q_\phi(\mathbf{z}|\mathbf{x}) mapping data to a latent distribution (typically Gaussian N(μϕ(x),σϕ2(x)I)\mathcal{N}(\boldsymbol{\mu}_\phi(\mathbf{x}), \boldsymbol{\sigma}_\phi^2(\mathbf{x})\mathbf{I})), and a decoder pθ(xz)p_\theta(\mathbf{x}|\mathbf{z}) reconstructing the data.
  • ELBO Objective: Because exact marginal likelihood is intractable, VAEs maximize the Evidence Lower Bound (ELBO):
logp(x)L(θ,ϕ)=Eqϕ(zx)[logpθ(xz)]KL(qϕ(zx)p(z)).\log p(\mathbf{x}) \ge \mathcal{L}(\theta,\phi) = \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})] - \text{KL}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z})).
The first term minimizes reconstruction error, while the second acts as a regularizer, pulling the approximate posterior toward the prior $p(\mathbf{z}) = \mathcal{N}(0,\mathbf{I})$.
  • Reparameterization Trick: To allow gradients to backpropagate through the stochastic sampling step, z\mathbf{z} is reparameterized: sample ϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(0,\mathbf{I}), then compute z=μϕ(x)+σϕ(x)ϵ\mathbf{z} = \boldsymbol{\mu}_\phi(\mathbf{x}) + \boldsymbol{\sigma}_\phi(\mathbf{x}) \odot \boldsymbol{\epsilon}.

Generative Adversarial Networks (GANs) frame generation as a two-player, zero-sum game: [nosep]

  • Definition: A generator network GθG_\theta maps noise zp(z)\mathbf{z} \sim p(\mathbf{z}) to fake data x~\tilde{\mathbf{x}}, while a discriminator DϕD_\phi predicts whether a sample is real (x\mathbf{x}) or fake (x~\tilde{\mathbf{x}}).
  • Minimax Objective:
minθmaxϕExpdata[logDϕ(x)]+Ezp(z)[log(1Dϕ(Gθ(z)))].\min_\theta \max_\phi \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}}[\log D_\phi(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p(\mathbf{z})}[\log(1 - D_\phi(G_\theta(\mathbf{z})))].
  • Training Pathologies: GANs are notoriously difficult to train, frequently suffering from mode collapse (the generator produces only a limited variety of outputs, ignoring parts of the data distribution) and vanishing gradients.
  • Improvements: Wasserstein GAN (WGAN) addresses instability by minimizing the Earth Mover’s distance using a 1-Lipschitz discriminator constrained via gradient penalty. StyleGAN introduces mapping networks and adaptive instance normalization for high-fidelity synthesis.

Diffusion Models (and Score-Based Generative Models) generate data by learning to reverse a gradual noising process: [nosep]

  • Forward Process (Diffusion): A Markov chain gradually corrupts data x0\mathbf{x}_0 by adding Gaussian noise over TT steps: q(xtxt1)=N(1βtxt1,βtI)q(\mathbf{x}_t|\mathbf{x}_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t\mathbf{I}).
  • Reverse Process (Denoising): A neural network learns the reverse transitions pθ(xt1xt)=N(μθ(xt,t),Σθ(xt,t))p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) = \mathcal{N}(\boldsymbol{\mu}_\theta(\mathbf{x}_t,t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t,t)) to synthesize data from pure noise xTN(0,I)\mathbf{x}_T \sim \mathcal{N}(0,\mathbf{I}).
  • Training Objective: DDPM minimizes a reweighted variational bound, effectively training a U-Net to predict the injected noise ϵθ(xt,t)\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t):
L=Et,x0,ϵ[ϵϵθ(αˉtx0+1αˉtϵ,t)2].\mathcal{L} = \mathbb{E}_{t,\mathbf{x}_0,\boldsymbol{\epsilon}}[\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}, t)\|^2].
  • Score Matching Connection: The noise prediction objective is mathematically equivalent to score matching, learning the gradient of the log data density: xtlogp(xt)ϵθ(xt,t)/1αˉt\nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t) \approx -\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)/\sqrt{1-\bar{\alpha}_t}. Latent Diffusion Models (e.g., Stable Diffusion) perform this process efficiently inside a frozen VAE latent space.

Self-Supervised Learning (SSL) extracts powerful representations from unannotated data by inventing pretext tasks: [nosep]

  • Contrastive Learning: Pulls representations of positive pairs (e.g., augmented views of the same image) closer while pushing negative pairs apart. SimCLR uses the InfoNCE loss:
L=logexp(sim(zi,zj)/τ)kiexp(sim(zi,zk)/τ),\mathcal{L} = -\log \frac{\exp(\text{sim}(\mathbf{z}_i,\mathbf{z}_j)/\tau)}{\sum_{k\neq i} \exp(\text{sim}(\mathbf{z}_i,\mathbf{z}_k)/\tau)},
where $\tau$ is a temperature hyperparameter controlling the penalty strength on hard negatives.
  • Non-contrastive Methods: Architectures like BYOL and SimSiam use momentum encoders and stop-gradient operations to prevent representation collapse without requiring negative pairs.
  • Masked Autoencoders: Models like BERT (text) and MAE (vision) mask patches of input data and train a network to reconstruct the missing content, enforcing deep contextual understanding.

For review, be able to: mathematically derive the VAE ELBO from Jensen’s inequality; explain the necessity of the reparameterization trick; describe the GAN minimax game and optimal discriminator; outline the diffusion forward/reverse Markov chain; and define the InfoNCE contrastive loss.

Section summary. Generative AI models data distributions. VAEs provide stable training and smooth latent spaces via variational inference. GANs employ an adversarial minimax game to achieve high-fidelity generation. Diffusion models use iterative denoising equivalent to score matching, currently yielding state-of-the-art visual quality. Self-supervised learning (contrastive or masked) converts raw data into informative latent structures, serving as the backbone for modern foundation models.

Reinforcement Learning: MDPs, TD Learning, Q-Learning, and Policy Gradients

Core ideas

Markov Decision Process (MDP) formalizes sequential decision-making under uncertainty: [nosep]

  • Definition: A tuple (S,A,P,R,γ)(\mathcal{S}, \mathcal{A}, P, R, \gamma) where S\mathcal{S} is the state space, A\mathcal{A} is the action space, P(ss,a)P(\mathbf{s}'|\mathbf{s}, a) is the transition probability, R(s,a,s)R(\mathbf{s}, a, \mathbf{s}') is the reward function, and γ[0,1]\gamma \in [0,1] is the discount factor.
  • Markov Property: The future depends only on the present state, not on the sequence of events that preceded it: P(st+1st,at,,s0,a0)=P(st+1st,at)P(\mathbf{s}_{t+1}|\mathbf{s}_t, \mathbf{a}_t, \ldots, \mathbf{s}_0, \mathbf{a}_0) = P(\mathbf{s}_{t+1}|\mathbf{s}_t, \mathbf{a}_t).
  • Return: The cumulative discounted reward from time tt: Gt=k=0γkRt+k+1G_t = \sum_{k=0}^\infty \gamma^k R_{t+k+1}.
  • **Policy \pi(a|\mathbf{s**):} A mapping from states to a probability distribution over actions.
  • Value Functions: [nosep]
  • State-value: Vπ(s)=Eπ[GtSt=s]V^\pi(\mathbf{s}) = \mathbb{E}_\pi[G_t | \mathbf{S}_t = \mathbf{s}].
  • Action-value: Qπ(s,a)=Eπ[GtSt=s,At=a]Q^\pi(\mathbf{s}, a) = \mathbb{E}_\pi[G_t | \mathbf{S}_t = \mathbf{s}, A_t = a].
  • Bellman Equations: Recursive relationships linking the value of a state to the values of its successors:
Vπ(s)=aπ(as)sP(ss,a)[R(s,a,s)+γVπ(s)].V^\pi(\mathbf{s}) = \sum_a \pi(a|\mathbf{s}) \sum_{\mathbf{s}'} P(\mathbf{s}'|\mathbf{s}, a)[R(\mathbf{s},a,\mathbf{s}') + \gamma V^\pi(\mathbf{s}')].
  • Bellman Optimality: The optimal policy π\pi^* satisfies V(s)=maxaQ(s,a)V^*(\mathbf{s}) = \max_a Q^*(\mathbf{s}, a).

Dynamic Programming solves known MDPs exactly: [nosep]

  • Policy Iteration: Alternate between evaluating VπV^\pi and greedily improving π(as)=1[a=argmaxaQπ(s,a)]\pi'(a|\mathbf{s}) = \mathbb{1}[a = \arg\max_a Q^\pi(\mathbf{s},a)].
  • Value Iteration: Iteratively apply the Bellman optimality operator until convergence.

Model-Free RL learns optimal policies from environmental interaction without known PP or RR.

Temporal Difference (TD) Learning blends Monte Carlo methods and dynamic programming: [nosep]

  • Mechanism: Bootstraps updates using current value estimates instead of waiting for episode completion.
  • TD(0) Update: V(st)V(st)+αδtV(\mathbf{s}_t) \leftarrow V(\mathbf{s}_t) + \alpha\delta_t, where δt=rt+1+γV(st+1)TD TargetV(st)\delta_t = \underbrace{r_{t+1} + \gamma V(\mathbf{s}_{t+1})}_{\text{TD Target}} - V(\mathbf{s}_t) is the TD Error.

Q-Learning and SARSA are canonical TD control algorithms: [nosep]

  • SARSA (On-Policy): Updates QQ using the action actually taken by the behavior policy:
Q(st,at)Q(st,at)+α[rt+1+γQ(st+1,at+1)Q(st,at)].Q(\mathbf{s}_t, a_t) \leftarrow Q(\mathbf{s}_t, a_t) + \alpha[r_{t+1} + \gamma Q(\mathbf{s}_{t+1}, a_{t+1}) - Q(\mathbf{s}_t, a_t)].
  • Q-Learning (Off-Policy): Updates QQ using the maximum possible value of the next state, uncoupling learning from the behavior policy:
Q(st,at)Q(st,at)+α[rt+1+γmaxaQ(st+1,a)Q(st,at)].Q(\mathbf{s}_t, a_t) \leftarrow Q(\mathbf{s}_t, a_t) + \alpha[r_{t+1} + \gamma \max_{a'} Q(\mathbf{s}_{t+1}, a') - Q(\mathbf{s}_t, a_t)].
  • Deep Q-Network (DQN): Stabilizes Q-learning with neural networks via experience replay (breaking temporal correlations) and target networks (providing fixed bootstrap targets).

Policy Gradient Methods directly parameterize and optimize the policy πθ\pi_\theta: [nosep]

  • Policy Gradient Theorem: Provides an analytic gradient for the expected return:
θJ(θ)=Eπθ[θlogπθ(as)Qπθ(s,a)].\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[ \nabla_\theta \log \pi_\theta(a|\mathbf{s}) \cdot Q^{\pi_\theta}(\mathbf{s},a) ].
  • REINFORCE: Uses sample returns GtG_t as an unbiased (but high-variance) estimator of QQ.
  • Actor-Critic: Combines policy gradients (the actor) with a learned value function (the critic) serving as a baseline to reduce variance. Subtracting the state-value yields the advantage function A(s,a)=Q(s,a)V(s)A(\mathbf{s},a) = Q(\mathbf{s},a) - V(\mathbf{s}).
  • PPO (Proximal Policy Optimization): Clips the policy ratio rt(θ)=πθ(as)πold(as)r_t(\theta) = \frac{\pi_\theta(a|s)}{\pi_{\text{old}}(a|s)} to prevent destructively large policy updates.

Exploration vs. Exploitation Tradeoff: Algorithms must balance exploiting known rewards with exploring undiscovered states using strategies like ϵ\epsilon-greedy, Boltzmann exploration, or Upper Confidence Bound (UCB).

For review, be able to: define the Markov Property; mathematically state the Bellman Optimality Equation; prove why Q-learning is off-policy while SARSA is on-policy; derive the REINFORCE update rule; and explain how Actor-Critic reduces variance.

Section summary. Reinforcement learning solves sequential decision-making governed by MDPs. Value-based methods (Q-Learning) learn optimal action-values off-policy. Policy gradient methods directly optimize a parameterized policy, overcoming limitations in continuous action spaces. Deep RL merges these frameworks with neural network function approximators (e.g., DQN, PPO), relying heavily on mechanisms like advantage estimation to stabilize learning.

Practice and Ethics: MLOps, Fairness, Explainability, and Safety

Core ideas

Machine Learning Operations (MLOps) standardizes the transition from experimental models to reliable, production-grade systems: [nosep]

  • Data Pipeline & Versioning: Tracks dataset lineage (e.g., DVC), standardizes feature engineering via feature stores, and automates data quality monitoring.
  • Experiment Tracking & CI/CD: Systematically logs hyperparameters, metrics, and artifacts (e.g., MLflow). Implements continuous integration for data validation and continuous deployment for model updates.
  • Monitoring in Production: Detects degradation over time, primarily distinguishing between data drift (covariate shift: P(x)P(\mathbf{x}) changes while P(yx)P(y|\mathbf{x}) remains constant) and concept drift (the underlying relationship P(yx)P(y|\mathbf{x}) changes, requiring retraining).

Fairness in ML systematically measures and mitigates discriminatory algorithmic behaviors: [nosep]

  • Sources of Bias: Encompasses historical bias, representation bias (under-sampling minority groups), and measurement bias (using flawed proxies for target variables).
  • Group Fairness Notions: For a sensitive attribute A{a,b}A \in \{a,b\}, target yy, and prediction y^\hat{y}: [nosep]
  • Demographic Parity: The model’s positive prediction rate is independent of the sensitive attribute: P(y^=1A=a)=P(y^=1A=b)P(\hat{y}=1|A=a) = P(\hat{y}=1|A=b).
  • Equalized Odds: Both the True Positive Rate (TPR) and False Positive Rate (FPR) are identical across groups: P(y^=1y=c,A=a)=P(y^=1y=c,A=b)P(\hat{y}=1|y=c, A=a) = P(\hat{y}=1|y=c, A=b) for c{0,1}c \in \{0,1\}.
  • Equal Opportunity: Requires equality only for the TPR (y=1y=1).
  • Impossibility Theorem (Chouldechova et al.): It is mathematically impossible to simultaneously satisfy demographic parity and equalized odds unless the base rates of the groups are exactly equal or the predictor is perfectly accurate.
  • Mitigation: Involves pre-processing (reweighing data), in-processing (adversarial debiasing, constrained optimization), or post-processing (group-specific thresholding).

Explainability and Interpretability open the “black box” of complex models: [nosep]

  • Intrinsic Interpretability: Using transparent model classes natively (e.g., shallow decision trees, sparse linear models).
  • Post-Hoc Feature Attribution: [nosep]
  • SHAP (SHapley Additive exPlanations): Grounded in cooperative game theory, it uniquely satisfies desirable axioms (local accuracy, missingness, consistency) to assign a marginal contribution to each feature.
  • LIME (Local Interpretable Model-agnostic Explanations): Fits a simple, interpretable surrogate model (e.g., linear regression) in the local neighborhood of a specific prediction.
  • Saliency & Counterfactuals: Saliency maps visualize gradients w.r.t.\ input pixels (in vision). Counterfactual explanations highlight the minimal perturbation required to alter the model’s decision.

AI Safety, Robustness, and Privacy ensure models perform reliably under stress and adversarial conditions: [nosep]

  • Adversarial Examples: Microscopic, carefully crafted perturbations δ\delta that drastically flip model predictions. [nosep]
  • Fast Gradient Sign Method (FGSM): A single-step attack: δ=ϵsign(xL(θ,x,y))\delta = \epsilon \cdot \text{sign}(\nabla_{\mathbf{x}} \mathcal{L}(\theta, \mathbf{x}, y)).
  • Projected Gradient Descent (PGD): An iterative, multi-step application of FGSM representing a strong first-order attack.
  • AI Alignment: The challenge of ensuring systems pursue intended human goals without exhibiting reward hacking (gaming the objective) or negative side-effects.
  • Differential Privacy (DP): Provides a mathematical guarantee that the inclusion or removal of a single data point does not significantly alter the output distribution of a randomized algorithm M\mathcal{M}. [nosep]
  • Formal Definition (ϵ\epsilon-DP): P(M(D)S)eϵP(M(D)S)P(\mathcal{M}(D) \in S) \le e^\epsilon P(\mathcal{M}(D') \in S) for all neighboring datasets D,DD, D' differing by one record, bounding the privacy loss by ϵ\epsilon.
  • Federated Learning: Enables decentralized model training without aggregating raw data centrally, often combined with secure multi-party computation.

For review, be able to: describe the MLOps lifecycle; rigorously define demographic parity and equalized odds; compute SHAP values for a minimal cooperative game; derive adversarial examples via FGSM; and explain the formal bounds of differential privacy.

Section summary. The deployment of machine learning is constrained by practical and ethical requirements. MLOps ensures rigorous, automated pipelines for continuous monitoring and deployment. Mathematical fairness guarantees audit and mitigate algorithmic bias. Explainability methods like SHAP and LIME provide interpretability for black-box models. AI Safety enforces robustness against adversarial attacks and distribution shifts, while Differential Privacy and Federated Learning protect sensitive data.