Back to Main Index Open Learning Roadmap

Missing Data, Imputation, and Safe ML Pipelines

Students often jump from raw spreadsheets straight into model training. In real health datasets, that step is dangerous. This page explains how to handle missing values, choose sensible preprocessing steps, and build pipelines that do not leak information across train, validation, and test data.

Beginner Learn what missing data are, why simple cleaning can bias results, and why preprocessing order matters.
Intermediate Compare imputation, encoding, scaling, and indicator features inside a proper train-only pipeline.
Advanced Think about MNAR data, repeated patients, temporal drift, and how preprocessing must sit inside cross-validation.
5 Core pipeline stages from raw data to model input
3 Main missingness ideas: MCAR, MAR, MNAR
1 Golden rule: fit preprocessing on training data only

Why this page matters

The portal already covers algorithms, evaluation, cross-validation, and hyperparameter tuning. The missing bridge is the practical preparation step that determines whether your model is learning from real signal or from careless data handling. In healthcare, missingness itself can carry meaning, so preprocessing is not just housekeeping.

Inspect data
Split safely
Fit preprocessing
Train and tune
Test once
Simple rule: split first, preprocess second, evaluate last.

Understand missing data before you “fix” it

Beginner

MCAR

Missing completely at random means the missingness is unrelated to other values or outcomes.

  • Rare in real clinical datasets.
  • Simple methods are less risky here.
  • Still worth checking how much data are missing.
Intermediate

MAR

Missing at random means missingness depends on observed information such as age, ward, or site.

  • Common in hospital datasets.
  • Imputation can use other observed predictors.
  • Missingness indicators may help.
Advanced

MNAR

Missing not at random means the missingness depends on the unobserved value itself or a hidden process.

  • Example: a lab test is ordered only when disease is suspected.
  • Simple imputation may distort relationships.
  • Sensitivity analysis becomes important.

Choose preprocessing by data situation

Use the buttons to compare common healthcare data situations. The best choice depends on the variable type, the study design, and what information would be available at prediction time.

Numeric variables with missing values

Think of blood pressure, creatinine, BMI, or other continuous measures.

  • Median imputation is a robust baseline.
  • Scaling can help models like logistic regression, SVM, and neural networks.
  • Add a missingness flag when absence may be informative.
Healthcare tip: if a variable is missing because clinicians choose when to measure it, the missingness may itself reflect disease severity or workflow.

What a safe preprocessing pipeline looks like

1

Inspect the raw data

Count missing values, check units, identify categorical and numeric variables, and note repeated patients or sites.

2

Split first

Create train, validation, and test sets before learning any means, medians, encodings, or selected features.

3

Fit preprocessing on training data

Learn imputation values, category mappings, and scaling parameters using the training split only.

4

Apply the same recipe elsewhere

Transform validation and test data with the training-fitted recipe, never with their own statistics.

5

Wrap it into CV and tuning

For cross-validation, the whole preprocessing pipeline must be refit inside each fold, not once on the full dataset.

6

Document the assumptions

Report how missingness was handled, which variables were encoded or scaled, and what the leakage risks were.

Common mistakes that quietly damage results

Imputing before the split

If the median or mode is computed on the full dataset, the future test data have already influenced the training pipeline.

Dropping all incomplete rows automatically

This can shrink the sample, remove sicker patients, and shift the target distribution.

Ignoring missingness as a signal

In clinical practice, “not measured” can sometimes mean “not suspected,” which may still contain predictive information.

Forgetting deployment reality

A beautiful preprocessing recipe is useless if it depends on variables not available when the prediction must be made.

Leakage warning: any preprocessing step that learns from the data must live inside the same train-only logic as the model itself.

Quick self-check

Use these short cases to test whether the preprocessing logic is safe and realistic.

Case 1

You fill all missing lab values using the overall dataset median before creating train and test sets. Safe?

Correct answer: No. The imputation value must be learned from the training data only, then applied to the other splits.

Case 2

A lab test is often missing because clinicians order it only for severe patients. Could a missingness indicator be useful?

Correct answer: Yes. Missingness itself may reflect clinical decision-making and can carry signal if handled carefully.

Case 3

You run one preprocessing recipe on the full dataset, then perform k-fold cross-validation on the transformed data. Is that ideal?

Correct answer: No. The preprocessing must be refit inside each fold, otherwise the validation folds influence the fitted transformation.

How this connects to other portal pages

Before data splitting and leakage

Splitting tells you when preprocessing is allowed to learn from the data and when it is not.

Before hyperparameter tuning

Model tuning is only meaningful after the preprocessing pipeline is stable and leakage-safe.

Before evaluation metrics

Strong metrics can be misleading if the inputs were cleaned in a way that used future information.