Cross-Validation & Training Concepts

Essential techniques for robust machine learning model evaluation

🎯Training, Validation & Test Sets

Understanding how to properly split your data is crucial for building reliable machine learning models that generalize well to unseen data.

Training Set (60-70%)

πŸ“š

Purpose: Train the model parameters

Usage: Model learns patterns from this data

Size: Largest portion of your dataset

Validation Set (15-20%)

πŸ”§

Purpose: Tune hyperparameters & model selection

Usage: Evaluate different model configurations

Size: Medium portion for reliable estimates

Test Set (15-20%)

🎯

Purpose: Final unbiased performance evaluation

Usage: Used only once at the end

Size: Sufficient for reliable performance estimate

Key Principles:

  • No Data Leakage: Test set should never be used during training or validation
  • Representative Splits: Each set should represent the overall data distribution
  • Stratification: Maintain class balance across splits for classification problems
  • Time Awareness: For time series data, respect temporal order in splits

πŸ”„Cross-Validation Methods

K-Fold Cross-Validation

Divides data into k equal folds. Each fold serves as validation set once while others train the model.

Pros

  • Uses all data for both training and validation
  • Reduces variance in performance estimates
  • Standard and widely accepted

Cons

  • Computationally expensive (k times more training)
  • May not preserve class distribution

Stratified K-Fold

Similar to K-Fold but maintains the class distribution in each fold, crucial for imbalanced datasets.

Pros

  • Preserves class distribution
  • Better for imbalanced datasets
  • More reliable performance estimates

Cons

  • Only applicable to classification
  • Still computationally expensive

Leave-One-Out (LOO)

Extreme case where k equals the number of samples. Each sample is used as validation set once.

Pros

  • Uses maximum data for training
  • Deterministic results
  • Good for small datasets

Cons

  • Extremely computationally expensive
  • High variance in estimates
  • Impractical for large datasets

Time Series Split

Respects temporal order by using past data for training and future data for validation.

Pros

  • Respects temporal dependencies
  • Mimics real-world deployment
  • Prevents data leakage from future

Cons

  • Only for time series data
  • Less data for early folds
  • May be affected by trends

⚑Training Process Workflow

1

Data Collection

Gather and clean your dataset

2

Data Splitting

Divide into train/val/test sets

3

Cross-Validation

Apply CV strategy on train+val

4

Model Training

Train models on each fold

5

Hyperparameter Tuning

Optimize based on CV results

6

Final Evaluation

Test best model on test set

πŸ“ŠBias-Variance Trade-off in Cross-Validation

Understanding the Trade-off:

  • More Folds (Higher k): Lower bias, higher variance in performance estimates
  • Fewer Folds (Lower k): Higher bias, lower variance in performance estimates
  • Sweet Spot: 5-10 folds typically provide good balance for most problems
  • Computational Cost: More folds mean more model training iterations

πŸ’‘Best Practices & Key Takeaways

  • Choose Appropriate CV: Consider your data type (time series, imbalanced, small datasets)
  • Consistent Preprocessing: Apply the same preprocessing to all folds
  • Avoid Data Leakage: Never use future information to predict the past
  • Report Statistics: Always report mean and standard deviation of CV scores
  • Stratify When Needed: Use stratified sampling for classification problems
  • Nested CV: Use nested cross-validation for unbiased model comparison
  • Save Your Models: Keep trained models from each fold for ensembling
  • Monitor Overfitting: Large gap between training and validation scores indicates overfitting

Β© 2025 Machine Learning for Health Research Course | Prof. Gennady Roshchupkin

Interactive slides designed for enhanced learning experience