Many students learn algorithms before they learn how to evaluate them safely. This page closes that gap:
what each data split does, how to choose the right split strategy, and how leakage quietly creates over-optimistic results in health research.
Beginner
Learn what the three splits mean and why a single accuracy number is not enough.
Intermediate
Compare random, stratified, grouped, and time-aware splitting for real study designs.
Advanced
Spot leakage from preprocessing, repeated patients, feature selection, and temporal drift.
3Core split roles to keep separate
4Common split strategies in practice
1Main rule: never let future information leak backward
Why this page matters
This portal already covers metrics, cross-validation, nested cross-validation, and hyperparameter tuning.
The missing bridge is the basic evaluation workflow that must happen before all of those topics make sense.
Without clean splitting, even a sophisticated model evaluation pipeline can still be wrong.
Train set Fit model parameters
Validation set Tune choices
Test set Final unbiased check
Simple rule: train learns, validation decides, test judges.
What each split is for
Beginner
Training set
Used to fit model parameters such as regression coefficients, tree splits, or neural network weights.
Biggest split because models learn here.
Can include internal resampling such as cross-validation.
Preprocessing should be fit here first.
Intermediate
Validation set
Used to compare models, pick hyperparameters, and decide when to stop tuning.
Answers "which version should I keep?"
Can be replaced by cross-validation when data are limited.
Should not be used for the final headline result.
Advanced
Test set
Used once at the end to estimate performance on truly unseen data.
Touch it only after the pipeline is fixed.
Do not tune thresholds repeatedly on it.
Think of it as a small external audit.
Choose the right splitting strategy
The best split is determined by the study design, not by habit. Use the buttons to compare the most common options.
Random split
Useful when observations are independent and class balance is not a major issue.
Fast and easy baseline.
Good for many textbook datasets.
Risky if the same patient appears more than once.
Stratified split
Preserves class proportions across splits, which matters when outcomes are imbalanced.
Useful for rare disease classification.
Keeps train, validation, and test more comparable.
Does not solve patient-level leakage by itself.
Grouped split
Keeps all records from the same patient, family, site, or hospital inside one split only.
Essential when repeated measures exist.
Often the right choice for health datasets.
Usually more realistic and a bit harder.
Time-aware split
Trains on earlier data and tests on later data, which mimics deployment more honestly.
Important for prognosis and hospital operations.
Protects against using future information.
Can reveal concept drift and changing prevalence.
Healthcare tip: if a patient can appear multiple times, grouped splitting is usually safer than plain random splitting.
Data leakage: the hidden source of fake performance
Leakage happens when information from the validation or test data influences model training or design choices.
The model then looks better in development than it will in the real world.
Common trap
Scaling before splitting
If you standardize the whole dataset first, the test set has already influenced the mean and standard deviation.
Common trap
Feature selection on all data
Selecting top genes or biomarkers before splitting can leak signal from the test set into the model.
Common trap
Patient overlap or future labels
Repeated visits, future lab values, or post-outcome variables can make the task unrealistically easy.
Leakage warning: any step that learns from data must be fit inside the training data only, then applied to validation or test data.
Mini workflow for safe model development
1
Define the prediction moment
Ask what information would truly be available at decision time.
2
Split first
Separate train, validation, and test before scaling, imputation, or feature selection.
3
Build the pipeline on training data
Fit preprocessing and the model only on the training split or within CV folds.
4
Tune with validation logic
Use a validation set or cross-validation to compare models fairly.
5
Report the test result once
Use the test set for the final performance estimate after all decisions are locked.
6
Explain limitations
Note class imbalance, site shift, drift, and whether grouped or temporal splitting was used.
Quick self-check
Use these short cases to test whether you can recognize good splitting practice.
Case 1
You normalize all variables before creating train and test sets. Is this safe?
Correct answer: No. The scaling parameters should be learned from the training data only, then applied to the other splits.
Case 2
You have five visits per patient. Should random row-wise splitting be your default?
Correct answer: No. Patient-level grouping is safer because row-wise splitting can put the same patient into train and test.
Case 3
You tune hyperparameters until the test score improves. Is the test set still unbiased?
Correct answer: No. Repeated tuning on the test set turns it into another validation set and inflates performance.
How this connects to other portal pages
Before ROC and metrics
Metrics mean little if the test set is contaminated or the split is unrealistic.
Before k-fold and nested CV
Cross-validation is a smarter validation design, not a replacement for safe split logic.
Before hyperparameter tuning
Tuning only helps when the evaluation loop is clean and leakage-free.