MACHINE LEARNING ALGORITHMS

Mathematical Foundations & Implementation Details

LINEAR MODELS
Linear Regression
$$\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n = \mathbf{X}^T \boldsymbol{\beta}$$
ŷ = predicted value
β₀ = intercept (bias term)
βᵢ = coefficient for feature i
X = feature matrix [n×p]
Cost Function (MSE)
$$J(\boldsymbol{\beta}) = \frac{1}{2m} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)})^2 = \frac{1}{2m} \|\mathbf{X}\boldsymbol{\beta} - \mathbf{y}\|^2$$
Minimize using Normal Equation
$$\boldsymbol{\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$
Or Gradient Descent
$$\boldsymbol{\beta} := \boldsymbol{\beta} - \alpha \frac{1}{m} \mathbf{X}^T(\mathbf{X}\boldsymbol{\beta} - \mathbf{y})$$
O(n²) training O(n) prediction
Logistic Regression
$$P(y=1|\mathbf{x}) = \sigma(z) = \frac{1}{1 + e^{-z}} \quad \text{where } z = \boldsymbol{\beta}^T\mathbf{x}$$
σ(z) = sigmoid function
z = linear combination βTx
Output: probability ∈ [0,1]
Log-Likelihood Cost
$$J(\boldsymbol{\beta}) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)}\log(h(x^{(i)})) + (1-y^{(i)})\log(1-h(x^{(i)}))]$$
Gradient (no closed form solution)
$$\nabla J = \frac{1}{m} \mathbf{X}^T(\sigma(\mathbf{X}\boldsymbol{\beta}) - \mathbf{y})$$
Update rule
$$\boldsymbol{\beta} := \boldsymbol{\beta} - \alpha \nabla J$$
O(n²k) training O(n) prediction
TREE-BASED MODELS
Decision Tree
$$\text{Information Gain} = H(S) - \sum_{v} \frac{|S_v|}{|S|} H(S_v)$$
H(S) = entropy of set S
Sᵥ = subset of S with value v
Entropy & Gini Impurity
$$\text{Entropy: } H(S) = -\sum_{i} p_i \log_2(p_i)$$
$$\text{Gini: } G(S) = 1 - \sum_{i} p_i^2$$
Split criterion
$$\text{Best split} = \arg\max(\text{Information Gain})$$
O(mn log n) training O(log n) prediction
Random Forest
$$\hat{y} = \frac{1}{B} \sum_{b=1}^{B} T_b(\mathbf{x}) \quad \text{(regression)} \quad \text{or} \quad \text{mode}\{T_1(\mathbf{x}),\ldots,T_B(\mathbf{x})\} \quad \text{(classification)}$$
B = number of trees
Tᵦ = b-th tree trained on bootstrap sample
Bootstrap Sampling
Sample with replacement: |bootstrap| = |original|
Feature bagging (mtry parameter)
At each split: randomly select √p features (classification)
At each split: randomly select p/3 features (regression)
O(B×mn log n) O(B×log n)
SUPPORT VECTOR MACHINES
SVM (Hard Margin)
$$\min \|\mathbf{w}\|^2 \quad \text{subject to: } y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 \quad \forall i$$
w = weight vector (normal to hyperplane)
b = bias term
Margin = 2/||w||
Lagrangian Dual
$$L = \frac{1}{2}\|\mathbf{w}\|^2 - \sum_{i} \alpha_i[y_i(\mathbf{w}^T\mathbf{x}_i + b) - 1]$$
KKT conditions lead to
$$\mathbf{w} = \sum_{i} \alpha_i y_i \mathbf{x}_i \quad \text{(support vectors only)}$$
$$\max \sum_{i} \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j$$
O(n³) training O(sv×n) prediction
SVM (Soft Margin + Kernel)
$$\min \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{i} \xi_i \quad \text{subject to: } y_i(\mathbf{w}^T\phi(\mathbf{x}_i) + b) \geq 1 - \xi_i$$
C = regularization parameter
ξᵢ = slack variables
φ(x) = feature mapping to higher dimension
Kernel Trick
$$K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^T\phi(\mathbf{x}_j)$$
Common kernels:
$$\text{RBF: } K(\mathbf{x},\mathbf{x}') = \exp(-\gamma\|\mathbf{x}-\mathbf{x}'\|^2)$$
$$\text{Polynomial: } K(\mathbf{x},\mathbf{x}') = (\gamma\mathbf{x}^T\mathbf{x}' + r)^d$$
$$\text{Decision: } f(\mathbf{x}) = \text{sign}\left(\sum_{i} \alpha_i y_i K(\mathbf{x}_i,\mathbf{x}) + b\right)$$
O(n²-n³) O(sv) prediction
CLUSTERING ALGORITHMS
K-Means
$$J = \sum_{i=1}^{n} \sum_{j=1}^{k} \|\mathbf{x}^{(i)} - \boldsymbol{\mu}_{c_i}\|^2 \quad \text{where } c_i \in \{1,\ldots,k\}$$
μcᵢ = centroid of cluster cᵢ
cᵢ = cluster assignment for point i
Lloyd's Algorithm
1. Initialize k centroids randomly
2. Assign points to nearest centroid
$$c^{(i)} := \arg\min_j \|\mathbf{x}^{(i)} - \boldsymbol{\mu}_j\|^2$$
3. Update centroids
$$\boldsymbol{\mu}_j := \frac{1}{|C_j|} \sum_{i \in C_j} \mathbf{x}^{(i)}$$
4. Repeat until convergence
O(nkdi) training O(k) prediction
Gaussian Mixture Model
$$p(\mathbf{x}) = \sum_{j=1}^{k} \pi_j \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_j,\boldsymbol{\Sigma}_j)$$
$$\text{where } \mathcal{N}(\mathbf{x}|\boldsymbol{\mu},\boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)$$
πⱼ = mixing coefficient (prior)
μⱼ = mean of component j
Σⱼ = covariance matrix of component j
EM Algorithm
E-step: Compute responsibilities
$$\gamma(z_{jk}) = \frac{\pi_j \mathcal{N}(\mathbf{x}_k|\boldsymbol{\mu}_j,\boldsymbol{\Sigma}_j)}{\sum_{l=1}^{k} \pi_l \mathcal{N}(\mathbf{x}_k|\boldsymbol{\mu}_l,\boldsymbol{\Sigma}_l)}$$
M-step: Update parameters
$$\boldsymbol{\mu}_j = \frac{1}{N_j} \sum_{k=1}^{n} \gamma(z_{jk}) \mathbf{x}_k$$
$$\boldsymbol{\Sigma}_j = \frac{1}{N_j} \sum_{k=1}^{n} \gamma(z_{jk}) (\mathbf{x}_k-\boldsymbol{\mu}_j)(\mathbf{x}_k-\boldsymbol{\mu}_j)^T$$
O(nkd²i) O(kd²)
NEURAL NETWORKS
Multi-Layer Perceptron
$$\mathbf{a}^{(l)} = \sigma(\mathbf{z}^{(l)}) = \sigma(\mathbf{W}^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)})$$
a⁽ˡ⁾ = activations at layer l
W⁽ˡ⁾ = weight matrix [nₗ × nₗ₋₁]
σ = activation function (ReLU, sigmoid, tanh)
Backpropagation
Forward pass: compute activations
Backward pass: compute gradients
$$\boldsymbol{\delta}^{(L)} = \nabla_{\mathbf{a}} C \odot \sigma'(\mathbf{z}^{(L)})$$
$$\boldsymbol{\delta}^{(l)} = ((\mathbf{W}^{(l+1)})^T \boldsymbol{\delta}^{(l+1)}) \odot \sigma'(\mathbf{z}^{(l)})$$
$$\frac{\partial C}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} (\mathbf{a}^{(l-1)})^T$$
O(epochs×batch×weights) O(weights)
Convolutional Neural Network
$$(f * g)[m,n] = \sum_{i} \sum_{j} f[i,j] \cdot g[m-i, n-j]$$
f = input feature map
g = kernel/filter
Output size = (W-F+2P)/S + 1
Convolution Layer
Multiple filters create feature maps
$$Y[i,j,d] = \sum_{u} \sum_{v} \sum_{c} W[u,v,c,d] \cdot X[i+u,j+v,c] + b[d]$$
Pooling reduces spatial dimensions
$$\text{Max pooling: } Y[i,j] = \max(X[si:si+f, sj:sj+f])$$
Parameter sharing: same filter across image
O(batch×H×W×filters×kernel²) O(parameters)
ALGORITHM COMPLEXITY COMPARISON
Algorithm Training Time Training Space Prediction Time Parameters Assumptions
Linear Regression O(n²) or O(nkd) O(nd) O(d) d+1 Linear relationship
Logistic Regression O(nkd) O(nd) O(d) d+1 Linear decision boundary
SVM O(n²) to O(n³) O(n²) O(sv×d) ≤n (support vectors) Large margin separation
K-Means O(nkdi) O(nd+kd) O(kd) kd Spherical clusters
Random Forest O(B×n log n×d) O(B×n) O(B×log n) B×tree_nodes Feature independence
Neural Network O(epochs×batch×weights) O(weights) O(weights) Σ(nₗ×nₗ₋₁) Universal approximation
// Variable Definitions
n = number of samples | d = number of features | k = number of clusters/iterations
m = number of samples | B = number of trees | sv = support vectors
epochs = training iterations | batch = batch size | weights = total parameters
DIMENSIONALITY REDUCTION
Principal Component Analysis (PCA)
$$\text{PC}_1 = \arg\max_{\|\mathbf{w}\|=1} \text{Var}(\mathbf{X}\mathbf{w}) = \mathbf{w}^T\boldsymbol{\Sigma}\mathbf{w}$$
w = principal component direction
Σ = covariance matrix
Find eigenvectors of Σ
Eigendecomposition
$$\boldsymbol{\Sigma}\mathbf{w} = \lambda\mathbf{w} \quad \text{(eigenvalue equation)}$$
Covariance matrix
$$\boldsymbol{\Sigma} = \frac{1}{n-1}\mathbf{X}^T\mathbf{X} \quad \text{where } \mathbf{X} \text{ is centered}$$
SVD decomposition
$$\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T \quad \rightarrow \quad \text{PCs are columns of } \mathbf{V}$$
$$\mathbf{Y} = \mathbf{X}\mathbf{W} \quad \text{(project to k dimensions)}$$
O(min(n²d, nd²)) O(kd) transform
t-SNE
$$p_{j|i} = \frac{\exp(-\|x_i-x_j\|^2/2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|x_i-x_k\|^2/2\sigma_i^2)}$$
p_{j|i} = conditional probability in high-D
σᵢ = bandwidth (perplexity-related)
Preserve local neighborhood structure
Symmetric SNE
$$p_{ij} = \frac{p_{j|i} + p_{i|j}}{2n}$$
$$q_{ij} = \frac{(1 + \|y_i-y_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|y_k-y_l\|^2)^{-1}}$$
KL divergence cost
$$C = \sum_{i} \text{KL}(P_i\|Q_i) = \sum_{i,j} p_{ij} \log\left(\frac{p_{ij}}{q_{ij}}\right)$$
$$\frac{\partial C}{\partial y_i} = 4\sum_{j} (p_{ij}-q_{ij})(y_i-y_j)(1+\|y_i-y_j\|^2)^{-1}$$
O(n²) per iteration O(n²) space
ENSEMBLE METHODS
AdaBoost
$$H(\mathbf{x}) = \text{sign}\left(\sum_{t=1}^{T} \alpha_t h_t(\mathbf{x})\right) \quad \text{where } \alpha_t = \frac{1}{2}\ln\left(\frac{1-\varepsilon_t}{\varepsilon_t}\right)$$
αₜ = weight of weak learner t
εₜ = weighted error rate
h_t = weak learner (e.g., decision stump)
Adaptive Boosting Algorithm
Initialize weights: $w_1^{(i)} = 1/n$
For t = 1 to T:
$$\varepsilon_t = \sum_{i} w_t^{(i)} \mathbb{I}[y_i \neq h_t(x_i)]$$
$$w_{t+1}^{(i)} = \frac{w_t^{(i)} \exp(\alpha_t \mathbb{I}[y_i \neq h_t(x_i)])}{Z_t}$$
Exponential loss minimization
$$L = \sum_{i} \exp\left(-y_i \sum_{t} \alpha_t h_t(x_i)\right)$$
O(T×weak_learner) O(T) prediction
Gradient Boosting
$$F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \gamma_m h_m(\mathbf{x}) \quad \text{where } h_m = \arg\min_h \sum_{i} (r_{im} - h(\mathbf{x}_i))^2$$
rᵢₘ = negative gradient (residual)
γₘ = step size
hₘ = weak learner fitted to residuals
Functional Gradient Descent
Initialize $F_0(\mathbf{x}) = \arg\min_c \sum_{i} L(y_i, c)$
For m = 1 to M:
$$r_{im} = -\left.\frac{\partial L(y_i, F(\mathbf{x}_i))}{\partial F(\mathbf{x}_i)}\right|_{F=F_{m-1}}$$
Fit $h_m(\mathbf{x})$ to residuals $\{r_{im}\}$
$$\gamma_m = \arg\min_{\gamma} \sum_{i} L(y_i, F_{m-1}(\mathbf{x}_i) + \gamma h_m(\mathbf{x}_i))$$
Common losses: MSE, MAE, log-loss
O(M×tree_training) O(M×tree_depth)
BAYESIAN METHODS
Naive Bayes
$$P(y=c|\mathbf{x}) = \frac{P(c) \prod_{i} P(x_i|c)}{P(\mathbf{x})} \quad \text{(independence assumption)}$$
P(c) = prior probability of class c
P(xᵢ|c) = likelihood of feature i given class c
Strong independence assumption
Different Distributions
Gaussian NB (continuous features)
$$P(x_i|c) = \frac{1}{\sqrt{2\pi\sigma^2_{ic}}} \exp\left(-\frac{(x_i-\mu_{ic})^2}{2\sigma^2_{ic}}\right)$$
Multinomial NB (discrete counts)
$$P(x_i|c) = \frac{N_{ic} + \alpha}{N_c + \alpha \times |V|}$$
Bernoulli NB (binary features)
$$P(x_i|c) = P_{ic}^{x_i}(1-P_{ic})^{(1-x_i)}$$
O(nd) training O(cd) prediction
Gaussian Process
$$f \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x},\mathbf{x}')) \quad \text{where } m(\mathbf{x}) = \mathbb{E}[f(\mathbf{x})], \quad k(\mathbf{x},\mathbf{x}') = \text{Cov}[f(\mathbf{x}),f(\mathbf{x}')]$$
m(x) = mean function
k(x,x') = covariance/kernel function
Bayesian non-parametric model
GP Regression (Prediction)
Training data: $\mathcal{D} = \{(\mathbf{x}_i,y_i)\}$
$$\mathbf{y} \sim \mathcal{N}(\mathbf{0}, \mathbf{K} + \sigma^2\mathbf{I}) \quad \text{where } K_{ij} = k(\mathbf{x}_i,\mathbf{x}_j)$$
Predictive distribution
$$p(f_*|\mathbf{x}_*,\mathcal{D}) = \mathcal{N}(\mu_*, \sigma_*^2)$$
$$\mu_* = \mathbf{k}_*^T(\mathbf{K} + \sigma^2\mathbf{I})^{-1}\mathbf{y}$$
$$\sigma_*^2 = k_{**} - \mathbf{k}_*^T(\mathbf{K} + \sigma^2\mathbf{I})^{-1}\mathbf{k}_*$$
O(n³) training O(n) prediction
REINFORCEMENT LEARNING
Q-Learning
$$Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma \max_{a'} Q(s',a') - Q(s,a)]$$
α = learning rate
γ = discount factor
r = immediate reward
Model-free TD learning
Bellman Equation
Optimal action-value function
$$Q^*(s,a) = \mathbb{E}[r + \gamma \max_{a'} Q^*(s',a')|s,a]$$
Temporal Difference error
$$\delta_t = r_{t+1} + \gamma \max_a Q(s_{t+1},a) - Q(s_t,a_t)$$
ε-greedy policy
$$\pi(a|s) = \begin{cases} 1-\varepsilon+\varepsilon/|A| & \text{if } a=\arg\max_a Q(s,a) \\ \varepsilon/|A| & \text{otherwise} \end{cases}$$
O(|S||A|) space O(1) update
Policy Gradient (REINFORCE)
$$\nabla_\theta J(\theta) = \mathbb{E}_\pi[\nabla_\theta \log \pi_\theta(a|s) Q^\pi(s,a)]$$
π_θ = parameterized policy
Q^π(s,a) = action-value under policy π
Direct policy optimization
Policy Gradient Theorem
Objective: maximize expected return
$$J(\theta) = \mathbb{E}_\pi[G_t] = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k+1}\right]$$
REINFORCE algorithm
$$\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t) G_t$$
With baseline (variance reduction)
$$\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t)(G_t - b(s_t))$$
O(episode_length) O(parameters)
NOTATION REFERENCE
COMMON SYMBOLS
x, X
= input features/matrix
y, Y
= target labels/matrix
θ, β, w
= parameters/weights
α
= learning rate
λ
= regularization parameter
σ, Σ
= standard deviation, covariance
DIMENSIONS
n
= number of samples
d, p
= number of features
k
= number of clusters/components
m
= mini-batch size
L
= number of layers
T, B
= number of trees/iterations
OPERATIONS
= element-wise multiplication
= gradient operator
||·||
= norm (usually L2)
argmax/min
= argument maximizing/minimizing
𝔼[·]
= expectation
= proportional to

Mathematical ML Reference Guide • Formulas, derivations, and complexity analysis

Includes 20+ core algorithms with mathematical foundations • ESL/PRML style notation

© 2025 Machine Learning for Health Research Course | Prof. Gennady Roshchupkin