ML Algorithms Mathematical Guide

LINEAR MODELS

Linear Regression

$$\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n = \mathbf{X}^T \boldsymbol{\beta}$$

ŷ = predicted value

β₀ = intercept (bias term)

βᵢ = coefficient for feature i

X = feature matrix [n×p]

Cost Function (MSE)

$$J(\boldsymbol{\beta}) = \frac{1}{2m} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)})^2 = \frac{1}{2m} \|\mathbf{X}\boldsymbol{\beta} - \mathbf{y}\|^2$$

Minimize using Normal Equation

$$\boldsymbol{\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

Or Gradient Descent

$$\boldsymbol{\beta} := \boldsymbol{\beta} - \alpha \frac{1}{m} \mathbf{X}^T(\mathbf{X}\boldsymbol{\beta} - \mathbf{y})$$

O(n²) training O(n) prediction

Logistic Regression

$$P(y=1|\mathbf{x}) = \sigma(z) = \frac{1}{1 + e^{-z}} \quad \text{where } z = \boldsymbol{\beta}^T\mathbf{x}$$

σ(z) = sigmoid function

z = linear combination β^Tx

Output: probability ∈ [0,1]

Log-Likelihood Cost

$$J(\boldsymbol{\beta}) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)}\log(h(x^{(i)})) + (1-y^{(i)})\log(1-h(x^{(i)}))]$$

Gradient (no closed form solution)

$$\nabla J = \frac{1}{m} \mathbf{X}^T(\sigma(\mathbf{X}\boldsymbol{\beta}) - \mathbf{y})$$

Update rule

$$\boldsymbol{\beta} := \boldsymbol{\beta} - \alpha \nabla J$$

O(n²k) training O(n) prediction

TREE-BASED MODELS

Decision Tree

$$\text{Information Gain} = H(S) - \sum_{v} \frac{|S_v|}{|S|} H(S_v)$$

H(S) = entropy of set S

Sᵥ = subset of S with value v

Entropy & Gini Impurity

$$\text{Entropy: } H(S) = -\sum_{i} p_i \log_2(p_i)$$

$$\text{Gini: } G(S) = 1 - \sum_{i} p_i^2$$

Split criterion

$$\text{Best split} = \arg\max(\text{Information Gain})$$

O(mn log n) training O(log n) prediction

Random Forest

$$\hat{y} = \frac{1}{B} \sum_{b=1}^{B} T_b(\mathbf{x}) \quad \text{(regression)} \quad \text{or} \quad \text{mode}\{T_1(\mathbf{x}),\ldots,T_B(\mathbf{x})\} \quad \text{(classification)}$$

B = number of trees

Tᵦ = b-th tree trained on bootstrap sample

Bootstrap Sampling

Sample with replacement: |bootstrap| = |original|

Feature bagging (mtry parameter)

At each split: randomly select √p features (classification)

At each split: randomly select p/3 features (regression)

O(B×mn log n) O(B×log n)

SUPPORT VECTOR MACHINES

SVM (Hard Margin)

$$\min \|\mathbf{w}\|^2 \quad \text{subject to: } y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 \quad \forall i$$

w = weight vector (normal to hyperplane)

b = bias term

Margin = 2/||w||

Lagrangian Dual

$$L = \frac{1}{2}\|\mathbf{w}\|^2 - \sum_{i} \alpha_i[y_i(\mathbf{w}^T\mathbf{x}_i + b) - 1]$$

KKT conditions lead to

$$\mathbf{w} = \sum_{i} \alpha_i y_i \mathbf{x}_i \quad \text{(support vectors only)}$$

$$\max \sum_{i} \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j$$

O(n³) training O(sv×n) prediction

SVM (Soft Margin + Kernel)

$$\min \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{i} \xi_i \quad \text{subject to: } y_i(\mathbf{w}^T\phi(\mathbf{x}_i) + b) \geq 1 - \xi_i$$

C = regularization parameter

ξᵢ = slack variables

φ(x) = feature mapping to higher dimension

Kernel Trick

$$K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^T\phi(\mathbf{x}_j)$$

Common kernels:

$$\text{RBF: } K(\mathbf{x},\mathbf{x}') = \exp(-\gamma\|\mathbf{x}-\mathbf{x}'\|^2)$$

$$\text{Polynomial: } K(\mathbf{x},\mathbf{x}') = (\gamma\mathbf{x}^T\mathbf{x}' + r)^d$$

$$\text{Decision: } f(\mathbf{x}) = \text{sign}\left(\sum_{i} \alpha_i y_i K(\mathbf{x}_i,\mathbf{x}) + b\right)$$

O(n²-n³) O(sv) prediction

CLUSTERING ALGORITHMS

K-Means

$$J = \sum_{i=1}^{n} \sum_{j=1}^{k} \|\mathbf{x}^{(i)} - \boldsymbol{\mu}_{c_i}\|^2 \quad \text{where } c_i \in \{1,\ldots,k\}$$

μcᵢ = centroid of cluster cᵢ

cᵢ = cluster assignment for point i

Lloyd's Algorithm

1. Initialize k centroids randomly

2. Assign points to nearest centroid

$$c^{(i)} := \arg\min_j \|\mathbf{x}^{(i)} - \boldsymbol{\mu}_j\|^2$$

3. Update centroids

$$\boldsymbol{\mu}_j := \frac{1}{|C_j|} \sum_{i \in C_j} \mathbf{x}^{(i)}$$

4. Repeat until convergence

O(nkdi) training O(k) prediction

Gaussian Mixture Model

$$p(\mathbf{x}) = \sum_{j=1}^{k} \pi_j \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_j,\boldsymbol{\Sigma}_j)$$

$$\text{where } \mathcal{N}(\mathbf{x}|\boldsymbol{\mu},\boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)$$

πⱼ = mixing coefficient (prior)

μⱼ = mean of component j

Σⱼ = covariance matrix of component j

EM Algorithm

E-step: Compute responsibilities

$$\gamma(z_{jk}) = \frac{\pi_j \mathcal{N}(\mathbf{x}_k|\boldsymbol{\mu}_j,\boldsymbol{\Sigma}_j)}{\sum_{l=1}^{k} \pi_l \mathcal{N}(\mathbf{x}_k|\boldsymbol{\mu}_l,\boldsymbol{\Sigma}_l)}$$

M-step: Update parameters

$$\boldsymbol{\mu}_j = \frac{1}{N_j} \sum_{k=1}^{n} \gamma(z_{jk}) \mathbf{x}_k$$

$$\boldsymbol{\Sigma}_j = \frac{1}{N_j} \sum_{k=1}^{n} \gamma(z_{jk}) (\mathbf{x}_k-\boldsymbol{\mu}_j)(\mathbf{x}_k-\boldsymbol{\mu}_j)^T$$

O(nkd²i) O(kd²)

NEURAL NETWORKS

Multi-Layer Perceptron

$$\mathbf{a}^{(l)} = \sigma(\mathbf{z}^{(l)}) = \sigma(\mathbf{W}^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)})$$

a⁽ˡ⁾ = activations at layer l

W⁽ˡ⁾ = weight matrix [nₗ × nₗ₋₁]

σ = activation function (ReLU, sigmoid, tanh)

Backpropagation

Forward pass: compute activations

Backward pass: compute gradients

$$\boldsymbol{\delta}^{(L)} = \nabla_{\mathbf{a}} C \odot \sigma'(\mathbf{z}^{(L)})$$

$$\boldsymbol{\delta}^{(l)} = ((\mathbf{W}^{(l+1)})^T \boldsymbol{\delta}^{(l+1)}) \odot \sigma'(\mathbf{z}^{(l)})$$

$$\frac{\partial C}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} (\mathbf{a}^{(l-1)})^T$$

O(epochs×batch×weights) O(weights)

Convolutional Neural Network

$$(f * g)[m,n] = \sum_{i} \sum_{j} f[i,j] \cdot g[m-i, n-j]$$

f = input feature map

g = kernel/filter

Output size = (W-F+2P)/S + 1

Convolution Layer

Multiple filters create feature maps

$$Y[i,j,d] = \sum_{u} \sum_{v} \sum_{c} W[u,v,c,d] \cdot X[i+u,j+v,c] + b[d]$$

Pooling reduces spatial dimensions

$$\text{Max pooling: } Y[i,j] = \max(X[si:si+f, sj:sj+f])$$

Parameter sharing: same filter across image

O(batch×H×W×filters×kernel²) O(parameters)

ALGORITHM COMPLEXITY COMPARISON

Algorithm	Training Time	Training Space	Prediction Time	Parameters	Assumptions
Linear Regression	O(n²) or O(nkd)	O(nd)	O(d)	d+1	Linear relationship
Logistic Regression	O(nkd)	O(nd)	O(d)	d+1	Linear decision boundary
SVM	O(n²) to O(n³)	O(n²)	O(sv×d)	≤n (support vectors)	Large margin separation
K-Means	O(nkdi)	O(nd+kd)	O(kd)	kd	Spherical clusters
Random Forest	O(B×n log n×d)	O(B×n)	O(B×log n)	B×tree_nodes	Feature independence
Neural Network	O(epochs×batch×weights)	O(weights)	O(weights)	Σ(nₗ×nₗ₋₁)	Universal approximation

DIMENSIONALITY REDUCTION

Principal Component Analysis (PCA)

$$\text{PC}_1 = \arg\max_{\|\mathbf{w}\|=1} \text{Var}(\mathbf{X}\mathbf{w}) = \mathbf{w}^T\boldsymbol{\Sigma}\mathbf{w}$$

w = principal component direction

Σ = covariance matrix

Find eigenvectors of Σ

Eigendecomposition

$$\boldsymbol{\Sigma}\mathbf{w} = \lambda\mathbf{w} \quad \text{(eigenvalue equation)}$$

Covariance matrix

$$\boldsymbol{\Sigma} = \frac{1}{n-1}\mathbf{X}^T\mathbf{X} \quad \text{where } \mathbf{X} \text{ is centered}$$

SVD decomposition

$$\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T \quad \rightarrow \quad \text{PCs are columns of } \mathbf{V}$$

$$\mathbf{Y} = \mathbf{X}\mathbf{W} \quad \text{(project to k dimensions)}$$

O(min(n²d, nd²)) O(kd) transform

t-SNE

$$p_{j|i} = \frac{\exp(-\|x_i-x_j\|^2/2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|x_i-x_k\|^2/2\sigma_i^2)}$$

p_{j|i} = conditional probability in high-D

σᵢ = bandwidth (perplexity-related)

Preserve local neighborhood structure

Symmetric SNE

$$p_{ij} = \frac{p_{j|i} + p_{i|j}}{2n}$$

$$q_{ij} = \frac{(1 + \|y_i-y_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|y_k-y_l\|^2)^{-1}}$$

KL divergence cost

$$C = \sum_{i} \text{KL}(P_i\|Q_i) = \sum_{i,j} p_{ij} \log\left(\frac{p_{ij}}{q_{ij}}\right)$$

$$\frac{\partial C}{\partial y_i} = 4\sum_{j} (p_{ij}-q_{ij})(y_i-y_j)(1+\|y_i-y_j\|^2)^{-1}$$

O(n²) per iteration O(n²) space

ENSEMBLE METHODS

AdaBoost

$$H(\mathbf{x}) = \text{sign}\left(\sum_{t=1}^{T} \alpha_t h_t(\mathbf{x})\right) \quad \text{where } \alpha_t = \frac{1}{2}\ln\left(\frac{1-\varepsilon_t}{\varepsilon_t}\right)$$

αₜ = weight of weak learner t

εₜ = weighted error rate

h_t = weak learner (e.g., decision stump)

Adaptive Boosting Algorithm

Initialize weights: $w_1^{(i)} = 1/n$

For t = 1 to T:

$$\varepsilon_t = \sum_{i} w_t^{(i)} \mathbb{I}[y_i \neq h_t(x_i)]$$

$$w_{t+1}^{(i)} = \frac{w_t^{(i)} \exp(\alpha_t \mathbb{I}[y_i \neq h_t(x_i)])}{Z_t}$$

Exponential loss minimization

$$L = \sum_{i} \exp\left(-y_i \sum_{t} \alpha_t h_t(x_i)\right)$$

O(T×weak_learner) O(T) prediction

Gradient Boosting

$$F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \gamma_m h_m(\mathbf{x}) \quad \text{where } h_m = \arg\min_h \sum_{i} (r_{im} - h(\mathbf{x}_i))^2$$

rᵢₘ = negative gradient (residual)

γₘ = step size

hₘ = weak learner fitted to residuals

Functional Gradient Descent

Initialize $F_0(\mathbf{x}) = \arg\min_c \sum_{i} L(y_i, c)$

For m = 1 to M:

$$r_{im} = -\left.\frac{\partial L(y_i, F(\mathbf{x}_i))}{\partial F(\mathbf{x}_i)}\right|_{F=F_{m-1}}$$

Fit $h_m(\mathbf{x})$ to residuals $\{r_{im}\}$

$$\gamma_m = \arg\min_{\gamma} \sum_{i} L(y_i, F_{m-1}(\mathbf{x}_i) + \gamma h_m(\mathbf{x}_i))$$

Common losses: MSE, MAE, log-loss

O(M×tree_training) O(M×tree_depth)

BAYESIAN METHODS

Naive Bayes

$$P(y=c|\mathbf{x}) = \frac{P(c) \prod_{i} P(x_i|c)}{P(\mathbf{x})} \quad \text{(independence assumption)}$$

P(c) = prior probability of class c

P(xᵢ|c) = likelihood of feature i given class c

Strong independence assumption

Different Distributions

Gaussian NB (continuous features)

$$P(x_i|c) = \frac{1}{\sqrt{2\pi\sigma^2_{ic}}} \exp\left(-\frac{(x_i-\mu_{ic})^2}{2\sigma^2_{ic}}\right)$$

Multinomial NB (discrete counts)

$$P(x_i|c) = \frac{N_{ic} + \alpha}{N_c + \alpha \times |V|}$$

Bernoulli NB (binary features)

$$P(x_i|c) = P_{ic}^{x_i}(1-P_{ic})^{(1-x_i)}$$

O(nd) training O(cd) prediction

Gaussian Process

$$f \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x},\mathbf{x}')) \quad \text{where } m(\mathbf{x}) = \mathbb{E}[f(\mathbf{x})], \quad k(\mathbf{x},\mathbf{x}') = \text{Cov}[f(\mathbf{x}),f(\mathbf{x}')]$$

m(x) = mean function

k(x,x') = covariance/kernel function

Bayesian non-parametric model

GP Regression (Prediction)

Training data: $\mathcal{D} = \{(\mathbf{x}_i,y_i)\}$

$$\mathbf{y} \sim \mathcal{N}(\mathbf{0}, \mathbf{K} + \sigma^2\mathbf{I}) \quad \text{where } K_{ij} = k(\mathbf{x}_i,\mathbf{x}_j)$$

Predictive distribution

$$p(f_*|\mathbf{x}_*,\mathcal{D}) = \mathcal{N}(\mu_*, \sigma_*^2)$$

$$\mu_* = \mathbf{k}_*^T(\mathbf{K} + \sigma^2\mathbf{I})^{-1}\mathbf{y}$$

$$\sigma_*^2 = k_{**} - \mathbf{k}_*^T(\mathbf{K} + \sigma^2\mathbf{I})^{-1}\mathbf{k}_*$$

O(n³) training O(n) prediction

REINFORCEMENT LEARNING

Q-Learning

$$Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma \max_{a'} Q(s',a') - Q(s,a)]$$

α = learning rate

γ = discount factor

r = immediate reward

Model-free TD learning

Bellman Equation

Optimal action-value function

$$Q^*(s,a) = \mathbb{E}[r + \gamma \max_{a'} Q^*(s',a')|s,a]$$

Temporal Difference error

$$\delta_t = r_{t+1} + \gamma \max_a Q(s_{t+1},a) - Q(s_t,a_t)$$

ε-greedy policy

$$\pi(a|s) = \begin{cases} 1-\varepsilon+\varepsilon/|A| & \text{if } a=\arg\max_a Q(s,a) \\ \varepsilon/|A| & \text{otherwise} \end{cases}$$

O(|S||A|) space O(1) update

Policy Gradient (REINFORCE)

$$\nabla_\theta J(\theta) = \mathbb{E}_\pi[\nabla_\theta \log \pi_\theta(a|s) Q^\pi(s,a)]$$

π_θ = parameterized policy

Q^π(s,a) = action-value under policy π

Direct policy optimization

Policy Gradient Theorem

Objective: maximize expected return

$$J(\theta) = \mathbb{E}_\pi[G_t] = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k+1}\right]$$

REINFORCE algorithm

$$\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t) G_t$$

With baseline (variance reduction)

$$\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t)(G_t - b(s_t))$$

O(episode_length) O(parameters)

NOTATION REFERENCE

COMMON SYMBOLS
x, X
 = input features/matrix
y, Y
 = target labels/matrix
θ, β, w
 = parameters/weights
α
 = learning rate
λ
 = regularization parameter
σ, Σ
 = standard deviation, covariance
                

DIMENSIONS
n
 = number of samples
d, p
 = number of features
k
 = number of clusters/components
m
 = mini-batch size
L
 = number of layers
T, B
 = number of trees/iterations
                

OPERATIONS
⊙
 = element-wise multiplication
∇
 = gradient operator
||·||
 = norm (usually L2)
argmax/min
 = argument maximizing/minimizing
𝔼[·]
 = expectation
∝
 = proportional to
                

MACHINE LEARNING ALGORITHMS