K-Fold Cross-Validation — Interactive Learning Tool

Understand how K-fold cross-validation works by visualizing the data splits and training process

🎯 What is K-Fold Cross-Validation?

K-fold cross-validation is a technique to assess how well a machine learning model will generalize to new data. It divides the training data into K equal parts (folds), then trains and validates the model K times, each time using a different fold as validation data.

All Data
Training data
Test data (held out)
Current Iteration (Whole Dataset View)
Train used this iter
Validation
Test

🎯 Final Step: Test Set Evaluation

Train final model on ALL training data
Evaluate on test set

This happens only once! After K-fold CV helps you select the best model configuration, you train the final model on the complete training dataset and get your unbiased performance estimate from the test set.

Header (fold index) Training Validation Held-out Test

📚 How K-Fold Cross-Validation Works

Step 1: The data is split into training and test sets. The test set is completely held out and never used during cross-validation.

💡 Key Concept: K-fold CV only operates on the training data. The test set remains untouched for final evaluation.

Step 2: The training data is divided into K equal folds. In each iteration, one fold serves as validation while the remaining K-1 folds are used for training.

Step 3: This process repeats K times, with each fold taking a turn as the validation set. The final performance is the average of all K validation scores.

🚨 When is the Test Set Used?

ONLY ONCE at the very end! After completing all K-fold cross-validation iterations and selecting the best model configuration, you train the final model on the entire training data and then evaluate it on the held-out test set. This gives you an unbiased estimate of how well your model will perform on completely unseen data.

⚠️ Important: Never use the test set during model selection or hyperparameter tuning. It's your final, unbiased evaluation set!
Data Distribution Iteration 1/5

Validation Performance Mean —
IterationValidation Score

🎯 Benefits of K-Fold CV

  • Robust evaluation: Uses all data for both training and validation
  • Reduced variance: Average of K estimates is more stable
  • Better generalization: More reliable performance estimate
  • No data waste: Every data point contributes to both training and validation

© 2025 Machine Learning for Health Research Course | Prof. Gennady Roshchupkin

Interactive slides designed for enhanced learning experience