🧠 ML Hyperparameters Guide

Master the essential hyperparameters for machine learning models in healthcare applications. This comprehensive guide covers the most important parameters that control model behavior, performance, and generalization.

6 Model Types
25+ Hyperparameters
Healthcare Focused

🏥 Healthcare Context

In healthcare ML applications, hyperparameter tuning is crucial for model reliability and interpretability. Proper regularization helps prevent overfitting on limited patient data, while balanced complexity ensures models generalize well across diverse populations and clinical settings. Feature selection and engineering are particularly critical given the high-dimensional nature of medical data (genomics, imaging, EHRs).

FE
Feature Engineering & Selection
Preprocessing • Critical
n_features
Number of features to select from available dataset
In healthcare: Balance between information richness and model interpretability. Too many features can lead to overfitting with limited patient samples
feature_selection_method
Method for selecting features (univariate, RFE, LASSO-based, mutual information)
Critical for genomics data with 20K+ genes, or EHR data with thousands of variables. Method affects model interpretability and performance
variance_threshold
Minimum variance required to keep a feature
Removes near-constant features common in medical data (e.g., rare conditions that are almost always absent)
correlation_threshold
Maximum correlation allowed between features
Removes redundant biomarkers or correlated lab values that don't add predictive value
feature_engineering_depth
Level of feature engineering (polynomial, interactions, domain-specific)
Medical ratios (e.g., LDL/HDL), age-adjusted values, and clinical scores often outperform raw measurements
LR
Linear Regression
Regularized • Regression
alpha
Strength of regularization (penalty on large coefficients)
Prevents overfitting and shrinks irrelevant features, especially critical in high-dimensional medical datasets with many biomarkers
penalty
Type of regularization: L1 (Lasso), L2 (Ridge), or ElasticNet
L1 performs automatic feature selection (eliminates irrelevant genes/markers); L2 shrinks weights smoothly for stability
l1_ratio
ElasticNet mixing parameter (balance between L1 and L2)
Combines feature selection (L1) with stability (L2), ideal for high-dimensional medical data with correlated features
LG
Logistic Regression
Classification • Probabilistic
C
Inverse of regularization strength (smaller C = stronger penalty)
Essential for balancing bias-variance tradeoff, especially with imbalanced medical datasets (rare diseases)
solver
Optimization algorithm (liblinear, saga, lbfgs)
Choice affects convergence speed and memory usage with large patient databases
class_weight
Weights for different classes ('balanced', custom weights)
Critical for imbalanced medical datasets where diseases are rare (e.g., 1% cancer cases)
NB
Naive Bayes
Probabilistic • Fast
var_smoothing
Adds small value to variance to prevent division by zero
Stabilizes calculations when medical features have near-zero variance, common in binary diagnostic indicators
feature_priors
Whether to learn feature priors or assume uniform distribution
Can incorporate medical domain knowledge about feature importance and prevalence
DT
Decision Tree
Interpretable • Rule-based
max_depth
Maximum depth of the tree structure
Limits model complexity and prevents overfitting, crucial for maintaining interpretability in clinical decision-making
min_samples_split
Minimum samples required to split an internal node
Ensures statistically meaningful splits, important when working with limited patient cohorts
criterion
Split quality measure: gini impurity or entropy
Determines how the algorithm chooses the best feature to split on at each decision point
max_features
Number of features considered for each split
Controls model variance and can improve generalization across different patient populations. 'sqrt' often works well for medical classification
class_weight
Weights for different classes to handle imbalanced data
Essential for rare disease detection where positive cases are much less frequent than negative cases
RF
Random Forest
Ensemble • Robust
n_estimators
Number of decision trees in the forest
More trees improve stability and reduce variance, critical for robust medical predictions (typically 100-500 trees)
max_depth
Maximum depth for each tree in the ensemble
Controls individual tree complexity, preventing overfitting while maintaining ensemble diversity
max_features
Number of features randomly selected at each split
Increases tree diversity and reduces correlation between trees, improving generalization
bootstrap
Whether to sample with replacement for each tree
Introduces diversity and robustness, essential for handling variability in patient data across different hospitals/populations
class_weight
Weights for different classes ('balanced', 'balanced_subsample')
Critical for medical datasets with class imbalance, helps prevent bias toward majority class (healthy patients)
oob_score
Whether to use out-of-bag samples for validation
Provides unbiased estimate of model performance without separate validation set, useful with limited medical data
SVM
Support Vector Machine
Kernel-based • Versatile
C
Regularization parameter (controls margin vs. training error)
Higher C creates complex decision boundary (low bias, high variance). Lower C creates smoother boundary. Critical for noisy medical data
kernel
Kernel function: linear, polynomial, RBF (Gaussian), sigmoid
Determines decision boundary shape. RBF good for non-linear medical patterns, linear for high-dimensional genomic data
gamma
Kernel coefficient for RBF, polynomial, and sigmoid kernels
Controls influence of single training examples. High gamma = tight fit (overfitting risk), low gamma = smooth fit
degree
Degree of polynomial kernel function
Higher degrees can capture complex feature interactions in medical data but increase computational cost and overfitting risk
class_weight
Weights for different classes ('balanced' or custom weights)
Essential for imbalanced medical datasets. Helps SVM not bias toward majority class (healthy vs. diseased)
probability
Whether to enable probability estimates
Important for medical applications where you need confidence scores, not just classifications (risk assessment)
tol
Tolerance for stopping criterion
Affects convergence speed vs. precision trade-off, important for large medical datasets

© 2025 Machine Learning for Health Research Course | Prof. Gennady Roshchupkin

Interactive slides designed for enhanced learning experience