Principal Component Analysis (PCA)

A Complete Visual Guide

What is PCA?

Principal Component Analysis is a dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional space while preserving as much variance as possible. It finds the directions (principal components) along which the data varies the most.

Key Applications:

  • Dimensionality reduction
  • Data visualization
  • Noise reduction
  • Feature extraction
  • Data compression

What You'll Learn:

  • Mathematical foundations
  • Geometric intuition
  • Step-by-step algorithm
  • Common mistakes
  • Practical implementation

Geometric Intuition

Imagine you have data points scattered in space. PCA finds the "best" directions to view this data - directions that show the most variation. These directions are the principal components.

Interactive Demo

Drag the slider to see how rotating the coordinate system changes the data representation. PCA finds the rotation that maximizes variance along the axes.

Mathematical Foundation: Covariance

PCA is fundamentally about understanding the covariance structure of your data. The covariance matrix captures how variables relate to each other.

Covariance between variables X and Y:
$$\text{Cov}(X,Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] = \mathbb{E}[XY] - \mu_X \mu_Y$$
Covariance Matrix (for data matrix X):
$$\mathbf{C} = \frac{1}{n-1} \mathbf{X}^T \mathbf{X}$$
where $\mathbf{X}$ is mean-centered data

Properties of Covariance Matrix:

  • Symmetric: C = C^T
  • Positive semi-definite
  • Diagonal elements = variances
  • Off-diagonal = covariances
Example 2×2 Covariance Matrix:
$$\mathbf{C} = \begin{bmatrix} \sigma_1^2 & \sigma_{12} \\ \sigma_{12} & \sigma_2^2 \end{bmatrix}$$

Eigendecomposition: The Heart of PCA

The principal components are the eigenvectors of the covariance matrix, and their importance is given by the corresponding eigenvalues.

Eigenvalue Equation:
$$\mathbf{C} \mathbf{v}_i = \lambda_i \mathbf{v}_i$$
where $\mathbf{v}_i$ is the $i$-th eigenvector and $\lambda_i$ is the corresponding eigenvalue
Full Eigendecomposition:
$$\mathbf{C} = \mathbf{P} \boldsymbol{\Lambda} \mathbf{P}^T$$
where $\mathbf{P}$ is the matrix of eigenvectors and $\boldsymbol{\Lambda}$ is the diagonal matrix of eigenvalues

Understanding Eigenvectors

Adjust the eigenvalues to see how they affect the data's covariance ellipse. Larger eigenvalues = more variance in that direction.

PCA Algorithm: Step by Step

  1. Standardize the data (usually): Center by subtracting mean, optionally scale by standard deviation
  2. Compute covariance matrix: $\mathbf{C} = \frac{1}{n-1} \mathbf{X}^T \mathbf{X}$
  3. Find eigenvalues and eigenvectors: Solve $\mathbf{C} \mathbf{v}_i = \lambda_i \mathbf{v}_i$
  4. Sort eigenvalues in descending order with corresponding eigenvectors
  5. Choose number of components: Select $k$ largest eigenvalues/eigenvectors
  6. Transform data: $\mathbf{Y} = \mathbf{X} \mathbf{P}_k$ where $\mathbf{P}_k$ contains first $k$ eigenvectors
# Python pseudocode import numpy as np # Step 1: Center the data X_centered = X - np.mean(X, axis=0) # Step 2: Compute covariance matrix cov_matrix = np.cov(X_centered.T) # Step 3: Eigendecomposition eigenvalues, eigenvectors = np.linalg.eig(cov_matrix) # Step 4: Sort by eigenvalues idx = eigenvalues.argsort()[::-1] eigenvalues = eigenvalues[idx] eigenvectors = eigenvectors[:, idx] # Step 5 & 6: Transform data k = 2 # number of components principal_components = X_centered @ eigenvectors[:, :k]

How Many Components to Keep?

This is one of the most important decisions in PCA. You want to retain enough components to capture most of the variance while achieving meaningful dimensionality reduction.

Methods to Choose k:

  • Explained Variance Ratio: Keep components until you reach desired % (e.g., 95%)
  • Scree Plot: Look for "elbow" in eigenvalue plot
  • Kaiser Rule: Keep components with eigenvalue > 1
  • Cross-validation: Use downstream task performance
Explained Variance Ratio:
$$\text{EVR}_i = \frac{\lambda_i}{\sum_{j=1}^p \lambda_j}$$

Cumulative Explained Variance:
$$\text{Cum}_k = \frac{\sum_{i=1}^k \lambda_i}{\sum_{j=1}^p \lambda_j}$$
where $p$ is the total number of components

Common Mistakes and Pitfalls

⚠️ Mistake #1: Not Centering Data

Always center your data (subtract mean). If you don't, the first PC might just point toward the data centroid rather than the direction of maximum variance.

⚠️ Mistake #2: Scaling Issues

When variables have different units/scales, consider standardizing (divide by std dev). Otherwise, variables with larger scales will dominate the PCs.

⚠️ Mistake #3: Interpreting Components Causally

PCs are mathematical constructs, not necessarily meaningful in your domain. Don't assume PC1 represents a "real" underlying factor.

⚠️ Mistake #4: Using PCA with Categorical Variables

PCA assumes linear relationships and continuous variables. For categorical data, consider correspondence analysis or other methods.

⚠️ Mistake #5: Ignoring Outliers

PCA is sensitive to outliers since it maximizes variance. Consider robust PCA methods or outlier removal.

Practical Considerations

When to Use PCA:

  • High-dimensional data (curse of dimensionality)
  • Multicollinear variables
  • Need for data visualization
  • Noise reduction required
  • Storage/computation constraints

When NOT to Use PCA:

  • Already low-dimensional data
  • Need interpretable features
  • Sparse data (many zeros)
  • Categorical variables
  • Non-linear relationships

Alternatives to Consider:

t-SNE/UMAP: For non-linear dimensionality reduction and visualization
Factor Analysis: When you want interpretable latent factors
Independent Component Analysis (ICA): When you need statistically independent components
Sparse PCA: When you want sparse, interpretable loadings

Reconstruction Error:
$$\text{Error} = \|\mathbf{X} - \mathbf{X}_{\text{reconstructed}}\|^2$$
$$\mathbf{X}_{\text{reconstructed}} = (\mathbf{X} \mathbf{P}_k) \mathbf{P}_k^T$$
Lower reconstruction error = better approximation

Advanced Topics and Extensions

Kernel PCA

Extends PCA to non-linear relationships by mapping data to higher-dimensional space using kernel trick.

$$K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j)$$

Sparse PCA

Adds sparsity constraints to get interpretable components with fewer non-zero loadings.

$$\min \|\mathbf{X} - \mathbf{X}\mathbf{W}\mathbf{H}^T\|^2 + \lambda\|\mathbf{W}\|_1$$

Robust PCA

Decomposes data into low-rank + sparse components, handling outliers and missing data better.

$$\mathbf{X} = \mathbf{L} + \mathbf{S} + \mathbf{N}$$
Low-rank + Sparse + Noise

Incremental PCA

Processes data in batches, useful for large datasets that don't fit in memory.

from sklearn.decomposition import IncrementalPCA ipca = IncrementalPCA(n_components=2) for batch in data_batches: ipca.partial_fit(batch)

Summary and Best Practices

🎯 Key Takeaways

  • PCA finds directions of maximum variance in your data
  • It's based on eigendecomposition of the covariance matrix
  • Always center your data, consider scaling
  • Choose components based on explained variance and downstream tasks
  • Be careful about interpretation - PCs are mathematical, not necessarily meaningful

✅ Best Practices:

  • Always visualize your data first
  • Check for outliers and handle appropriately
  • Validate dimensionality reduction with downstream tasks
  • Document your preprocessing steps
  • Consider domain knowledge in interpretation
  • Test different numbers of components

📊 Implementation Checklist:

  • ✓ Data exploration and cleaning
  • ✓ Decide on centering/scaling
  • ✓ Compute PCA transformation
  • ✓ Analyze explained variance
  • ✓ Choose optimal number of components
  • ✓ Validate results
  • ✓ Document assumptions and limitations
Remember the fundamental equation:
$$\mathbf{Y} = \mathbf{X} \mathbf{P}$$
Transformed Data = Original Data × Principal Components

© 2025 Machine Learning for Health Research Course | Prof. Gennady Roshchupkin