🚀 Optimization Algorithms

📉

Gradient Descent

Iterative method to minimize error by adjusting model parameters. Moves in the direction that reduces the error the most (downhill).

Steady descent path - consistent but can be slow

✓

Simple and reliable

△

Can be slow on flat surfaces

⚠

May get stuck in local minima

⚡

Momentum

Adds inertia to Gradient Descent for faster convergence. Helps avoid getting stuck in small dips by carrying momentum from previous steps.

Accelerated path - builds speed and overshoots valleys

✓

Faster convergence

✓

Escapes local minima

⚠

Can overshoot optimum

📊

RMSprop

Adjusts the learning rate based on the steepness of the error surface. Takes smaller steps on steep slopes and larger steps on shallow slopes.

Adaptive path - adjusts step size based on terrain

✓

Adaptive learning rate

✓

Handles different scales well

△

Learning rate can decay too fast

🎯

Adam

Combines Momentum and RMSprop for efficient and stable learning. Adjusts step sizes and remembers past movements for smarter updates.

Optimal path - combines speed and adaptivity

✓

Best of both worlds

✓

Robust and efficient

✓

📈 Performance Comparison

How each algorithm performs across different criteria

Gradient Descent

Speed:

★ ★ ★ ★ ★

Stability:

★ ★ ★ ★ ★

Momentum

Speed:

★ ★ ★ ★ ★

Stability:

★ ★ ★ ★ ★

RMSprop

Speed:

★ ★ ★ ★ ★

Stability:

★ ★ ★ ★ ★

Adam

Speed:

★ ★ ★ ★ ★

Stability:

★ ★ ★ ★ ★

🔢 Mathematical Formulations

Gradient Descent

θ = θ - α∇J(θ)

Simple parameter update with learning rate α

Momentum

v = βv + α∇J(θ)
θ = θ - v

Adds velocity term with momentum β

RMSprop

E[g²] = βE[g²] + (1-β)g²
θ = θ - α·g/√(E[g²] + ε)

Adapts learning rate based on gradient magnitude

Adam

m = β₁m + (1-β₁)g
v = β₂v + (1-β₂)g²
θ = θ - α·m̂/√(v̂ + ε)

Combines momentum and adaptive learning rates