Back to Main Index Open Learning Roadmap

Train, Validation, Test Splits and Data Leakage

Many students learn algorithms before they learn how to evaluate them safely. This page closes that gap: what each data split does, how to choose the right split strategy, and how leakage quietly creates over-optimistic results in health research.

Beginner Learn what the three splits mean and why a single accuracy number is not enough.
Intermediate Compare random, stratified, grouped, and time-aware splitting for real study designs.
Advanced Spot leakage from preprocessing, repeated patients, feature selection, and temporal drift.
3 Core split roles to keep separate
4 Common split strategies in practice
1 Main rule: never let future information leak backward

Why this page matters

This portal already covers metrics, cross-validation, nested cross-validation, and hyperparameter tuning. The missing bridge is the basic evaluation workflow that must happen before all of those topics make sense. Without clean splitting, even a sophisticated model evaluation pipeline can still be wrong.

Train set
Fit model parameters
Validation set
Tune choices
Test set
Final unbiased check
Simple rule: train learns, validation decides, test judges.

What each split is for

Beginner

Training set

Used to fit model parameters such as regression coefficients, tree splits, or neural network weights.

  • Biggest split because models learn here.
  • Can include internal resampling such as cross-validation.
  • Preprocessing should be fit here first.
Intermediate

Validation set

Used to compare models, pick hyperparameters, and decide when to stop tuning.

  • Answers "which version should I keep?"
  • Can be replaced by cross-validation when data are limited.
  • Should not be used for the final headline result.
Advanced

Test set

Used once at the end to estimate performance on truly unseen data.

  • Touch it only after the pipeline is fixed.
  • Do not tune thresholds repeatedly on it.
  • Think of it as a small external audit.

Choose the right splitting strategy

The best split is determined by the study design, not by habit. Use the buttons to compare the most common options.

Random split

Useful when observations are independent and class balance is not a major issue.

  • Fast and easy baseline.
  • Good for many textbook datasets.
  • Risky if the same patient appears more than once.
Healthcare tip: if a patient can appear multiple times, grouped splitting is usually safer than plain random splitting.

Data leakage: the hidden source of fake performance

Leakage happens when information from the validation or test data influences model training or design choices. The model then looks better in development than it will in the real world.

Common trap

Scaling before splitting

If you standardize the whole dataset first, the test set has already influenced the mean and standard deviation.

Common trap

Feature selection on all data

Selecting top genes or biomarkers before splitting can leak signal from the test set into the model.

Common trap

Patient overlap or future labels

Repeated visits, future lab values, or post-outcome variables can make the task unrealistically easy.

Leakage warning: any step that learns from data must be fit inside the training data only, then applied to validation or test data.

Mini workflow for safe model development

1

Define the prediction moment

Ask what information would truly be available at decision time.

2

Split first

Separate train, validation, and test before scaling, imputation, or feature selection.

3

Build the pipeline on training data

Fit preprocessing and the model only on the training split or within CV folds.

4

Tune with validation logic

Use a validation set or cross-validation to compare models fairly.

5

Report the test result once

Use the test set for the final performance estimate after all decisions are locked.

6

Explain limitations

Note class imbalance, site shift, drift, and whether grouped or temporal splitting was used.

Quick self-check

Use these short cases to test whether you can recognize good splitting practice.

Case 1

You normalize all variables before creating train and test sets. Is this safe?

Correct answer: No. The scaling parameters should be learned from the training data only, then applied to the other splits.

Case 2

You have five visits per patient. Should random row-wise splitting be your default?

Correct answer: No. Patient-level grouping is safer because row-wise splitting can put the same patient into train and test.

Case 3

You tune hyperparameters until the test score improves. Is the test set still unbiased?

Correct answer: No. Repeated tuning on the test set turns it into another validation set and inflates performance.

How this connects to other portal pages

Before ROC and metrics

Metrics mean little if the test set is contaminated or the split is unrealistic.

Before k-fold and nested CV

Cross-validation is a smarter validation design, not a replacement for safe split logic.

Before hyperparameter tuning

Tuning only helps when the evaluation loop is clean and leakage-free.