Introduction: What Are Activation Functions and Why They Matter
Activation functions are mathematical operations applied to neural network nodes that determine whether a neuron should be activated or not based on the weighted sum of inputs. They introduce non-linearity into neural networks, enabling them to learn complex patterns and relationships in data that simple linear models cannot capture. Without activation functions, neural networks would simply perform linear regression, regardless of their depth.
Core Concepts and Principles
- Neuron Activation: Transforms the weighted sum of inputs (z) into an output value (a)
- Non-linearity: Enables neural networks to model complex, non-linear relationships
- Differentiability: Most activation functions need to be differentiable for backpropagation
- Vanishing/Exploding Gradient Problem: Some activation functions can lead to gradients becoming too small or too large during training
- Sparsity: Certain activation functions promote sparse activations (many neurons output zero)
- Computational Efficiency: Some functions are more computationally expensive than others
Common Activation Functions
Sigmoid Function
Formula: σ(z) = 1 / (1 + e^(-z))
Range: (0, 1)
Properties:
- Smooth and differentiable everywhere
- Historically popular but rarely used as hidden layer activation in modern networks
- Suffers from vanishing gradient problem for values far from zero
- Outputs are not zero-centered
Use Cases:
- Output layer for binary classification
- Gates in LSTM and GRU units
Hyperbolic Tangent (tanh)
Formula: tanh(z) = (e^z – e^(-z)) / (e^z + e^(-z))
Range: (-1, 1)
Properties:
- Zero-centered outputs
- Steeper gradients than sigmoid
- Still suffers from vanishing gradient problem
- Squishes inputs to a range between -1 and 1
Use Cases:
- Hidden layers in shallow networks
- Recurrent neural networks
- When zero-centered activations are preferred
Rectified Linear Unit (ReLU)
Formula: ReLU(z) = max(0, z)
Range: [0, ∞)
Properties:
- Computationally efficient
- Helps mitigate vanishing gradient problem
- Non-differentiable at z=0
- Can suffer from “dying ReLU” problem (neurons that always output 0)
Use Cases:
- Default choice for hidden layers in CNNs
- Most feed-forward neural networks
- Deep networks
Leaky ReLU
Formula: Leaky_ReLU(z) = max(αz, z) where α is a small constant (e.g., 0.01)
Range: (-∞, ∞)
Properties:
- Allows small negative values when input is less than zero
- Helps prevent dying ReLU problem
- Preserves some gradient for negative inputs
Use Cases:
- Alternative to ReLU when neuron death is a concern
- Deep networks with potential for sparse gradients
Parametric ReLU (PReLU)
Formula: PReLU(z) = max(αz, z) where α is a learnable parameter
Range: (-∞, ∞)
Properties:
- Learnable slope for negative inputs
- Can adapt during training
- May overfit on small datasets
Use Cases:
- When you want the network to learn the optimal negative slope
- Deep networks with sufficient training data
Exponential Linear Unit (ELU)
Formula: ELU(z) = z if z > 0, α(e^z – 1) if z ≤ 0
Range: (-α, ∞)
Properties:
- Smooth function (differentiable everywhere)
- Can produce negative outputs
- Helps mitigate vanishing gradient problem
- Computationally more expensive than ReLU
Use Cases:
- Deep neural networks
- When negative values need to be handled differently than in ReLU variants
Scaled Exponential Linear Unit (SELU)
Formula: SELU(z) = λ * ELU(z, α)
Range: (λ * -α, ∞)
Properties:
- Self-normalizing properties
- Automatically ensures normalized outputs
- Helps networks converge faster
- Requires specific weight initialization (LeCun normal)
Use Cases:
- Deep feed-forward networks
- Self-normalizing neural networks
- When batch normalization isn’t feasible
Softmax
Formula: softmax(z)ᵢ = e^zᵢ / Σⱼ e^zⱼ
Range: (0, 1) for each output, with sum = 1
Properties:
- Converts logits to probability distribution
- Outputs sum to 1
- Emphasizes the largest values while suppressing lower ones
Use Cases:
- Output layer for multi-class classification
- When probability distribution over classes is needed
Swish / SiLU (Sigmoid Linear Unit)
Formula: Swish(z) = z * sigmoid(z) = z * (1 / (1 + e^(-z)))
Range: (-0.278, ∞)
Properties:
- Smooth, non-monotonic function
- Outperforms ReLU in many deep models
- Self-gating property
- Computationally more expensive than ReLU
Use Cases:
- Modern deep neural networks
- When slightly better performance than ReLU is needed
GELU (Gaussian Error Linear Unit)
Formula: GELU(z) = z * Φ(z) where Φ is the cumulative distribution function of the standard normal distribution
Range: (-∞, ∞)
Properties:
- Smooth, non-monotonic function
- Combines aspects of dropout and ReLU
- Used in state-of-the-art transformer models
Use Cases:
- Transformer models (BERT, GPT)
- Deep networks where performance is prioritized over computational efficiency
Comparison Table of Activation Functions
Activation Function | Range | Differentiable | Zero-Centered | Computation Cost | Vanishing Gradient | Dying Neurons |
---|---|---|---|---|---|---|
Sigmoid | (0, 1) | Yes | No | Medium | Yes | No |
Tanh | (-1, 1) | Yes | Yes | Medium | Yes | No |
ReLU | [0, ∞) | No (at x=0) | No | Low | No | Yes |
Leaky ReLU | (-∞, ∞) | No (at x=0) | No | Low | No | Reduced |
PReLU | (-∞, ∞) | No (at x=0) | No | Low | No | Reduced |
ELU | (-α, ∞) | Yes | No | Medium | Reduced | No |
SELU | (λ*-α, ∞) | Yes | Approx. | Medium | No | No |
Swish | (-0.278, ∞) | Yes | No | Medium | Reduced | No |
GELU | (-∞, ∞) | Yes | No | High | Reduced | No |
Softmax | (0, 1) | Yes | No | High | N/A | No |
When to Use Each Activation Function
Step-by-Step Selection Process
For hidden layers in standard feedforward networks:
- Start with ReLU (default choice)
- If encountering dying neurons, try Leaky ReLU or ELU
- For very deep networks, consider SELU
- For state-of-the-art performance, try GELU or Swish
For output layers:
- Binary classification: Sigmoid
- Multi-class classification: Softmax
- Regression: Linear (no activation)
For recurrent neural networks:
- LSTM/GRU cells: Sigmoid for gates, tanh for states
- General RNN hidden states: tanh or ReLU variants
For self-normalizing networks:
- Use SELU with LeCun normal initialization
Common Challenges and Solutions
Challenge | Affected Functions | Solutions |
---|---|---|
Vanishing Gradient | Sigmoid, Tanh | Use ReLU, ELU, or SELU instead |
Exploding Gradient | Mostly a weight initialization issue | Use gradient clipping, batch normalization |
Dying ReLU | ReLU | Use Leaky ReLU, PReLU, or ELU |
Computational Efficiency | GELU, Swish | Use ReLU for deployment if speed is critical |
Non-zero Mean Activations | ReLU, Sigmoid | Use tanh or batch normalization |
Non-differentiable Points | ReLU, Leaky ReLU | Use smoothed variants if this causes issues |
Best Practices and Practical Tips
- Start Simple: Begin with ReLU for hidden layers and appropriate output activations
- Experiment Systematically: If performance is unsatisfactory, try other activation functions
- Combine with Normalization: Pair activation functions with batch normalization for better training
- Consider Computational Budget: More complex activation functions add computational overhead
- Watch for Warning Signs:
- If loss doesn’t decrease, you might have dying neurons (try Leaky ReLU)
- If gradients explode, consider gradient clipping
- If training is unstable, try more stable functions like ELU or SELU
- Architecture-Specific Choices:
- For CNN: ReLU, Leaky ReLU, or Swish
- For RNN: tanh, ReLU variants
- For Transformers: GELU
- Layer-Specific Choices: Consider using different activation functions for different layers
- Initialize Properly: Some activations work best with specific weight initialization methods
Implementation Examples
PyTorch
import torch
import torch.nn as nn
# Common activation functions
sigmoid = nn.Sigmoid()
tanh = nn.Tanh()
relu = nn.ReLU()
leaky_relu = nn.LeakyReLU(negative_slope=0.01)
prelu = nn.PReLU()
elu = nn.ELU(alpha=1.0)
selu = nn.SELU()
softmax = nn.Softmax(dim=1)
gelu = nn.GELU()
# Swish/SiLU implementation
class Swish(nn.Module):
def forward(self, x):
return x * torch.sigmoid(x)
TensorFlow/Keras
import tensorflow as tf
from tensorflow.keras.layers import Activation
from tensorflow.keras import backend as K
# Common activation functions
sigmoid = Activation('sigmoid')
tanh = Activation('tanh')
relu = Activation('relu')
leaky_relu = Activation(tf.keras.layers.LeakyReLU(alpha=0.01))
elu = Activation('elu')
selu = Activation('selu')
softmax = Activation('softmax')
gelu = Activation(tf.keras.activations.gelu)
# Swish/SiLU implementation
def swish(x):
return x * K.sigmoid(x)
tf.keras.utils.get_custom_objects().update({'swish': Activation(swish)})
swish_activation = Activation('swish')
Resources for Further Learning
Books
- “Deep Learning” by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
- “Neural Networks and Deep Learning” by Michael Nielsen
Research Papers
- “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification” (He et al., 2015) – Introduces PReLU
- “Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)” (Clevert et al., 2015)
- “Self-Normalizing Neural Networks” (Klambauer et al., 2017) – Introduces SELU
- “Searching for Activation Functions” (Ramachandran et al., 2017) – Introduces Swish
- “Gaussian Error Linear Units (GELUs)” (Hendrycks & Gimpel, 2016)
Online Resources
- CS231n Stanford Course Notes on Activation Functions
- Towards Data Science: Activation Functions Explained
- TensorFlow Documentation on Activation Functions
- PyTorch Documentation on Activation Functions