Students often jump from raw spreadsheets straight into model training. In real health datasets, that step is
dangerous. This page explains how to handle missing values, choose sensible preprocessing steps, and build
pipelines that do not leak information across train, validation, and test data.
Beginner
Learn what missing data are, why simple cleaning can bias results, and why preprocessing order matters.
Intermediate
Compare imputation, encoding, scaling, and indicator features inside a proper train-only pipeline.
Advanced
Think about MNAR data, repeated patients, temporal drift, and how preprocessing must sit inside cross-validation.
5Core pipeline stages from raw data to model input
3Main missingness ideas: MCAR, MAR, MNAR
1Golden rule: fit preprocessing on training data only
Why this page matters
The portal already covers algorithms, evaluation, cross-validation, and hyperparameter tuning. The missing bridge is
the practical preparation step that determines whether your model is learning from real signal or from careless data handling.
In healthcare, missingness itself can carry meaning, so preprocessing is not just housekeeping.
Inspect data
Split safely
Fit preprocessing
Train and tune
Test once
Simple rule: split first, preprocess second, evaluate last.
Understand missing data before you “fix” it
Beginner
MCAR
Missing completely at random means the missingness is unrelated to other values or outcomes.
Rare in real clinical datasets.
Simple methods are less risky here.
Still worth checking how much data are missing.
Intermediate
MAR
Missing at random means missingness depends on observed information such as age, ward, or site.
Common in hospital datasets.
Imputation can use other observed predictors.
Missingness indicators may help.
Advanced
MNAR
Missing not at random means the missingness depends on the unobserved value itself or a hidden process.
Example: a lab test is ordered only when disease is suspected.
Simple imputation may distort relationships.
Sensitivity analysis becomes important.
Choose preprocessing by data situation
Use the buttons to compare common healthcare data situations. The best choice depends on the variable type,
the study design, and what information would be available at prediction time.
Numeric variables with missing values
Think of blood pressure, creatinine, BMI, or other continuous measures.
Median imputation is a robust baseline.
Scaling can help models like logistic regression, SVM, and neural networks.
Add a missingness flag when absence may be informative.
Categorical variables and codes
Think of smoking status, ward type, hospital site, or ICD groups.
Create an explicit “missing” category if it is meaningful.
One-hot encoding is common and easy to explain.
Rare-category grouping may improve stability.
Repeated measurements from the same patient
Longitudinal and multi-visit datasets need special care.
Split by patient before fitting imputers or scalers.
Row-wise imputation can leak person-specific patterns.
Grouped cross-validation is often more realistic.
Prediction over time
When deployment happens later than development, the preprocessing must respect time order too.
Fit imputers on earlier data only.
Watch for changing assay methods or documentation practices.
Recheck calibration and missingness rates over time.
Healthcare tip: if a variable is missing because clinicians choose when to measure it, the missingness may itself reflect disease severity or workflow.
What a safe preprocessing pipeline looks like
1
Inspect the raw data
Count missing values, check units, identify categorical and numeric variables, and note repeated patients or sites.
2
Split first
Create train, validation, and test sets before learning any means, medians, encodings, or selected features.
3
Fit preprocessing on training data
Learn imputation values, category mappings, and scaling parameters using the training split only.
4
Apply the same recipe elsewhere
Transform validation and test data with the training-fitted recipe, never with their own statistics.
5
Wrap it into CV and tuning
For cross-validation, the whole preprocessing pipeline must be refit inside each fold, not once on the full dataset.
6
Document the assumptions
Report how missingness was handled, which variables were encoded or scaled, and what the leakage risks were.
Common mistakes that quietly damage results
Imputing before the split
If the median or mode is computed on the full dataset, the future test data have already influenced the training pipeline.
Dropping all incomplete rows automatically
This can shrink the sample, remove sicker patients, and shift the target distribution.
Ignoring missingness as a signal
In clinical practice, “not measured” can sometimes mean “not suspected,” which may still contain predictive information.
Forgetting deployment reality
A beautiful preprocessing recipe is useless if it depends on variables not available when the prediction must be made.
Leakage warning: any preprocessing step that learns from the data must live inside the same train-only logic as the model itself.
Quick self-check
Use these short cases to test whether the preprocessing logic is safe and realistic.
Case 1
You fill all missing lab values using the overall dataset median before creating train and test sets. Safe?
Correct answer: No. The imputation value must be learned from the training data only, then applied to the other splits.
Case 2
A lab test is often missing because clinicians order it only for severe patients. Could a missingness indicator be useful?
Correct answer: Yes. Missingness itself may reflect clinical decision-making and can carry signal if handled carefully.
Case 3
You run one preprocessing recipe on the full dataset, then perform k-fold cross-validation on the transformed data. Is that ideal?
Correct answer: No. The preprocessing must be refit inside each fold, otherwise the validation folds influence the fitted transformation.
How this connects to other portal pages
Before data splitting and leakage
Splitting tells you when preprocessing is allowed to learn from the data and when it is not.
Before hyperparameter tuning
Model tuning is only meaningful after the preprocessing pipeline is stable and leakage-safe.
Before evaluation metrics
Strong metrics can be misleading if the inputs were cleaned in a way that used future information.