1 / 8

Neural Network Common Issues & Solutions

Understanding and Fixing Training Problems

🧠 Deep Learning Fundamentals

Common Neural Network Issues

🔴 Training Issues

  • Gradient Vanishing/Exploding
  • Overfitting
  • Underfitting
  • Slow Convergence

⚡ Optimization Issues

  • Poor Weight Initialization
  • Learning Rate Problems
  • Activation Function Issues
  • Batch Size Effects
Key Point: Most issues stem from improper gradient flow, poor regularization, or suboptimal hyperparameters.

Gradient Vanishing Problem

🔴 Problem

Gradients become exponentially smaller as they propagate backward through deep networks, especially with sigmoid/tanh activations.

Result: Early layers learn very slowly or stop learning entirely.

✅ Solutions

1. Better Activation Functions
# Replace sigmoid/tanh with ReLU variants import torch.nn as nn # Instead of: nn.Sigmoid() or nn.Tanh() nn.ReLU() # Standard ReLU nn.LeakyReLU(0.01) # Leaky ReLU nn.ELU() # Exponential Linear Unit nn.GELU() # Gaussian Error Linear Unit
2. Proper Weight Initialization
# Xavier/Glorot initialization nn.init.xavier_uniform_(layer.weight) # He initialization (for ReLU) nn.init.kaiming_uniform_(layer.weight, mode='fan_in', nonlinearity='relu')
3. Residual Connections
# Skip connections allow gradients to flow directly class ResidualBlock(nn.Module): def forward(self, x): residual = x out = self.conv1(x) out = self.relu(out) out = self.conv2(out) out += residual # Skip connection return self.relu(out)

Gradient Exploding Problem

🔴 Problem

Gradients become exponentially larger during backpropagation, causing unstable training and NaN values.

Symptoms: Loss shoots to infinity, weights become NaN, training diverges.

✅ Solutions

1. Gradient Clipping
# Clip gradients by norm torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # Clip gradients by value torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5) # In training loop: optimizer.zero_grad() loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) optimizer.step()
2. Lower Learning Rates
# Start with smaller learning rates optimizer = torch.optim.Adam(model.parameters(), lr=1e-4) # Instead of 1e-2 # Use learning rate scheduling scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)
3. Batch Normalization
# Normalize inputs to each layer class NormalizedNet(nn.Module): def __init__(self): super().__init__() self.conv1 = nn.Conv2d(3, 64, 3) self.bn1 = nn.BatchNorm2d(64) # Batch normalization self.relu = nn.ReLU() def forward(self, x): x = self.conv1(x) x = self.bn1(x) # Normalize before activation x = self.relu(x) return x

Overfitting

🔴 Problem

Model memorizes training data but fails to generalize to new data.

Signs: Training accuracy >> Validation accuracy, large gap between train/val loss.

✅ Solutions

1. Dropout
# Randomly set neurons to zero during training class DropoutNet(nn.Module): def __init__(self): super().__init__() self.fc1 = nn.Linear(784, 256) self.dropout1 = nn.Dropout(0.5) # 50% dropout self.fc2 = nn.Linear(256, 10) def forward(self, x): x = self.fc1(x) x = torch.relu(x) x = self.dropout1(x) # Only active during training x = self.fc2(x) return x
2. L1/L2 Regularization
# Add regularization to optimizer optimizer = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4) # L2 # Manual L1 regularization def l1_regularization(model, lambda_l1): l1_norm = sum(p.abs().sum() for p in model.parameters()) return lambda_l1 * l1_norm # In loss calculation: loss = criterion(outputs, targets) + l1_regularization(model, 1e-5)
3. Early Stopping
# Stop training when validation loss stops improving best_val_loss = float('inf') patience = 5 patience_counter = 0 for epoch in range(epochs): # ... training code ... if val_loss < best_val_loss: best_val_loss = val_loss patience_counter = 0 torch.save(model.state_dict(), 'best_model.pth') else: patience_counter += 1 if patience_counter >= patience: print("Early stopping!") break

Underfitting

🔴 Problem

Model is too simple to capture underlying patterns in the data.

Signs: Both training and validation accuracy are low, high bias.

✅ Solutions

1. Increase Model Complexity
# Add more layers or neurons class LargerNet(nn.Module): def __init__(self): super().__init__() self.layers = nn.Sequential( nn.Linear(784, 512), # Increase hidden units nn.ReLU(), nn.Linear(512, 256), # Add more layers nn.ReLU(), nn.Linear(256, 128), nn.ReLU(), nn.Linear(128, 10) )
2. Reduce Regularization
# Lower dropout rates nn.Dropout(0.2) # Instead of 0.5 # Reduce weight decay optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-6) # Lower
3. Feature Engineering
# Add polynomial features or feature interactions from sklearn.preprocessing import PolynomialFeatures # Create polynomial features poly = PolynomialFeatures(degree=2, include_bias=False) X_poly = poly.fit_transform(X) # Or use more sophisticated architectures (CNN for images, RNN for sequences)

Learning Rate Problems

🔴 Too High

  • Loss oscillates wildly
  • Training diverges
  • Overshoots minima

🔴 Too Low

  • Very slow convergence
  • Gets stuck in local minima
  • Training plateaus early

✅ Solutions

1. Learning Rate Scheduling
# Step decay scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1) # Cosine annealing scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50) # Reduce on plateau scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5) # In training loop: for epoch in range(epochs): # ... training code ... scheduler.step() # or scheduler.step(val_loss) for ReduceLROnPlateau
2. Adaptive Optimizers
# Use optimizers with adaptive learning rates optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) # Good default # or optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01) # or optimizer = torch.optim.RMSprop(model.parameters(), lr=1e-3)
3. Learning Rate Finder
# Find optimal learning rate range def find_lr(model, train_loader, optimizer, criterion): lrs = [] losses = [] lr = 1e-8 for batch_idx, (data, target) in enumerate(train_loader): optimizer.param_groups[0]['lr'] = lr optimizer.zero_grad() output = model(data) loss = criterion(output, target) loss.backward() optimizer.step() lrs.append(lr) losses.append(loss.item()) lr *= 1.1 # Exponentially increase if lr > 1: break # Plot and find the steepest descent plt.plot(lrs, losses) plt.xscale('log')

Summary & Best Practices

🎯 Prevention Checklist

  • ✅ Use ReLU-family activations
  • ✅ Proper weight initialization
  • ✅ Batch normalization
  • ✅ Gradient clipping
  • ✅ Dropout for regularization
  • ✅ Early stopping
  • ✅ Learning rate scheduling
  • ✅ Monitor train/val metrics

🔧 Debugging Workflow

  1. Check data: Normalize inputs, verify labels
  2. Start simple: Small model first
  3. Monitor gradients: Use gradient norms
  4. Validate assumptions: Plot loss curves
  5. Iterative improvement: One change at a time

💡 Key Takeaway

Most neural network issues can be prevented with proper architecture design, initialization, and hyperparameter tuning. Always start with proven defaults and adjust based on your specific problem!

# Template for robust neural network class RobustNet(nn.Module): def __init__(self, input_size, hidden_size, num_classes, dropout=0.3): super().__init__() self.network = nn.Sequential( nn.Linear(input_size, hidden_size), nn.BatchNorm1d(hidden_size), nn.ReLU(), nn.Dropout(dropout), nn.Linear(hidden_size, hidden_size//2), nn.BatchNorm1d(hidden_size//2), nn.ReLU(), nn.Dropout(dropout), nn.Linear(hidden_size//2, num_classes) ) # He initialization for m in self.modules(): if isinstance(m, nn.Linear): nn.init.kaiming_uniform_(m.weight, mode='fan_in', nonlinearity='relu') def forward(self, x): return self.network(x)

© 2025 Machine Learning for Health Research Course | Prof. Gennady Roshchupkin

Interactive slides designed for enhanced learning experience