ML Hyperparameters Guide - Healthcare Focus

Feature Engineering & Selection

Preprocessing • Critical

n_features

Number of features to select from available dataset

In healthcare: Balance between information richness and model interpretability. Too many features can lead to overfitting with limited patient samples

feature_selection_method

Method for selecting features (univariate, RFE, LASSO-based, mutual information)

Critical for genomics data with 20K+ genes, or EHR data with thousands of variables. Method affects model interpretability and performance

variance_threshold

Minimum variance required to keep a feature

Removes near-constant features common in medical data (e.g., rare conditions that are almost always absent)

correlation_threshold

Maximum correlation allowed between features

Removes redundant biomarkers or correlated lab values that don't add predictive value

feature_engineering_depth

Level of feature engineering (polynomial, interactions, domain-specific)

Medical ratios (e.g., LDL/HDL), age-adjusted values, and clinical scores often outperform raw measurements

Linear Regression

Regularized • Regression

alpha

Strength of regularization (penalty on large coefficients)

Prevents overfitting and shrinks irrelevant features, especially critical in high-dimensional medical datasets with many biomarkers

penalty

Type of regularization: L1 (Lasso), L2 (Ridge), or ElasticNet

L1 performs automatic feature selection (eliminates irrelevant genes/markers); L2 shrinks weights smoothly for stability

l1_ratio

ElasticNet mixing parameter (balance between L1 and L2)

Combines feature selection (L1) with stability (L2), ideal for high-dimensional medical data with correlated features

Logistic Regression

Classification • Probabilistic

Inverse of regularization strength (smaller C = stronger penalty)

Essential for balancing bias-variance tradeoff, especially with imbalanced medical datasets (rare diseases)

solver

Optimization algorithm (liblinear, saga, lbfgs)

Choice affects convergence speed and memory usage with large patient databases

class_weight

Weights for different classes ('balanced', custom weights)

Critical for imbalanced medical datasets where diseases are rare (e.g., 1% cancer cases)

Naive Bayes

Probabilistic • Fast

var_smoothing

Adds small value to variance to prevent division by zero

Stabilizes calculations when medical features have near-zero variance, common in binary diagnostic indicators

feature_priors

Whether to learn feature priors or assume uniform distribution

Can incorporate medical domain knowledge about feature importance and prevalence

Decision Tree

Interpretable • Rule-based

max_depth

Maximum depth of the tree structure

Limits model complexity and prevents overfitting, crucial for maintaining interpretability in clinical decision-making

min_samples_split

Minimum samples required to split an internal node

Ensures statistically meaningful splits, important when working with limited patient cohorts

criterion

Split quality measure: gini impurity or entropy

Determines how the algorithm chooses the best feature to split on at each decision point

max_features

Number of features considered for each split

Controls model variance and can improve generalization across different patient populations. 'sqrt' often works well for medical classification

class_weight

Weights for different classes to handle imbalanced data

Essential for rare disease detection where positive cases are much less frequent than negative cases

Random Forest

Ensemble • Robust

n_estimators

Number of decision trees in the forest

More trees improve stability and reduce variance, critical for robust medical predictions (typically 100-500 trees)

max_depth

Maximum depth for each tree in the ensemble

Controls individual tree complexity, preventing overfitting while maintaining ensemble diversity

max_features

Number of features randomly selected at each split

Increases tree diversity and reduces correlation between trees, improving generalization

bootstrap

Whether to sample with replacement for each tree

Introduces diversity and robustness, essential for handling variability in patient data across different hospitals/populations

class_weight

Weights for different classes ('balanced', 'balanced_subsample')

Critical for medical datasets with class imbalance, helps prevent bias toward majority class (healthy patients)

oob_score

Whether to use out-of-bag samples for validation

Provides unbiased estimate of model performance without separate validation set, useful with limited medical data

SVM

Support Vector Machine

Kernel-based • Versatile

Regularization parameter (controls margin vs. training error)

Higher C creates complex decision boundary (low bias, high variance). Lower C creates smoother boundary. Critical for noisy medical data

kernel

Kernel function: linear, polynomial, RBF (Gaussian), sigmoid

Determines decision boundary shape. RBF good for non-linear medical patterns, linear for high-dimensional genomic data

gamma

Kernel coefficient for RBF, polynomial, and sigmoid kernels

Controls influence of single training examples. High gamma = tight fit (overfitting risk), low gamma = smooth fit

degree

Degree of polynomial kernel function

Higher degrees can capture complex feature interactions in medical data but increase computational cost and overfitting risk

class_weight

Weights for different classes ('balanced' or custom weights)

Essential for imbalanced medical datasets. Helps SVM not bias toward majority class (healthy vs. diseased)

probability

Whether to enable probability estimates

Important for medical applications where you need confidence scores, not just classifications (risk assessment)

tol

Tolerance for stopping criterion

Affects convergence speed vs. precision trade-off, important for large medical datasets

🧠 ML Hyperparameters Guide

🏥 Healthcare Context