Ultimate Activation Functions Cheat Sheet: A Complete Guide for Deep Learning

Introduction: What Are Activation Functions and Why They Matter

Activation functions are mathematical operations applied to neural network nodes that determine whether a neuron should be activated or not based on the weighted sum of inputs. They introduce non-linearity into neural networks, enabling them to learn complex patterns and relationships in data that simple linear models cannot capture. Without activation functions, neural networks would simply perform linear regression, regardless of their depth.

Core Concepts and Principles

Neuron Activation: Transforms the weighted sum of inputs (z) into an output value (a)
Non-linearity: Enables neural networks to model complex, non-linear relationships
Differentiability: Most activation functions need to be differentiable for backpropagation
Vanishing/Exploding Gradient Problem: Some activation functions can lead to gradients becoming too small or too large during training
Sparsity: Certain activation functions promote sparse activations (many neurons output zero)
Computational Efficiency: Some functions are more computationally expensive than others

Common Activation Functions

Sigmoid Function

Formula: σ(z) = 1 / (1 + e^(-z))
Range: (0, 1)
Properties:

Smooth and differentiable everywhere
Historically popular but rarely used as hidden layer activation in modern networks
Suffers from vanishing gradient problem for values far from zero
Outputs are not zero-centered

Use Cases:

Output layer for binary classification
Gates in LSTM and GRU units

Hyperbolic Tangent (tanh)

Formula: tanh(z) = (e^z – e^(-z)) / (e^z + e^(-z))
Range: (-1, 1)
Properties:

Zero-centered outputs
Steeper gradients than sigmoid
Still suffers from vanishing gradient problem
Squishes inputs to a range between -1 and 1

Use Cases:

Hidden layers in shallow networks
Recurrent neural networks
When zero-centered activations are preferred

Rectified Linear Unit (ReLU)

Formula: ReLU(z) = max(0, z)
Range: [0, ∞)
Properties:

Computationally efficient
Helps mitigate vanishing gradient problem
Non-differentiable at z=0
Can suffer from “dying ReLU” problem (neurons that always output 0)

Use Cases:

Default choice for hidden layers in CNNs
Most feed-forward neural networks
Deep networks

Leaky ReLU

Formula: Leaky_ReLU(z) = max(αz, z) where α is a small constant (e.g., 0.01)
Range: (-∞, ∞)
Properties:

Allows small negative values when input is less than zero
Helps prevent dying ReLU problem
Preserves some gradient for negative inputs

Use Cases:

Alternative to ReLU when neuron death is a concern
Deep networks with potential for sparse gradients

Parametric ReLU (PReLU)

Formula: PReLU(z) = max(αz, z) where α is a learnable parameter
Range: (-∞, ∞)
Properties:

Learnable slope for negative inputs
Can adapt during training
May overfit on small datasets

Use Cases:

When you want the network to learn the optimal negative slope
Deep networks with sufficient training data

Exponential Linear Unit (ELU)

Formula: ELU(z) = z if z > 0, α(e^z – 1) if z ≤ 0
Range: (-α, ∞)
Properties:

Smooth function (differentiable everywhere)
Can produce negative outputs
Helps mitigate vanishing gradient problem
Computationally more expensive than ReLU

Use Cases:

Deep neural networks
When negative values need to be handled differently than in ReLU variants

Scaled Exponential Linear Unit (SELU)

Formula: SELU(z) = λ * ELU(z, α)
Range: (λ * -α, ∞)
Properties:

Self-normalizing properties
Automatically ensures normalized outputs
Helps networks converge faster
Requires specific weight initialization (LeCun normal)

Use Cases:

Deep feed-forward networks
Self-normalizing neural networks
When batch normalization isn’t feasible

Softmax

Formula: softmax(z)ᵢ = e^zᵢ / Σⱼ e^zⱼ
Range: (0, 1) for each output, with sum = 1
Properties:

Converts logits to probability distribution
Outputs sum to 1
Emphasizes the largest values while suppressing lower ones

Use Cases:

Output layer for multi-class classification
When probability distribution over classes is needed

Swish / SiLU (Sigmoid Linear Unit)

Formula: Swish(z) = z * sigmoid(z) = z * (1 / (1 + e^(-z)))
Range: (-0.278, ∞)
Properties:

Smooth, non-monotonic function
Outperforms ReLU in many deep models
Self-gating property
Computationally more expensive than ReLU

Use Cases:

Modern deep neural networks
When slightly better performance than ReLU is needed

GELU (Gaussian Error Linear Unit)

Formula: GELU(z) = z * Φ(z) where Φ is the cumulative distribution function of the standard normal distribution
Range: (-∞, ∞)
Properties:

Smooth, non-monotonic function
Combines aspects of dropout and ReLU
Used in state-of-the-art transformer models

Use Cases:

Transformer models (BERT, GPT)
Deep networks where performance is prioritized over computational efficiency

Comparison Table of Activation Functions

Activation Function	Range	Differentiable	Zero-Centered	Computation Cost	Vanishing Gradient	Dying Neurons
Sigmoid	(0, 1)	Yes	No	Medium	Yes	No
Tanh	(-1, 1)	Yes	Yes	Medium	Yes	No
ReLU	[0, ∞)	No (at x=0)	No	Low	No	Yes
Leaky ReLU	(-∞, ∞)	No (at x=0)	No	Low	No	Reduced
PReLU	(-∞, ∞)	No (at x=0)	No	Low	No	Reduced
ELU	(-α, ∞)	Yes	No	Medium	Reduced	No
SELU	(λ*-α, ∞)	Yes	Approx.	Medium	No	No
Swish	(-0.278, ∞)	Yes	No	Medium	Reduced	No
GELU	(-∞, ∞)	Yes	No	High	Reduced	No
Softmax	(0, 1)	Yes	No	High	N/A	No

When to Use Each Activation Function

Step-by-Step Selection Process

For hidden layers in standard feedforward networks:
- Start with ReLU (default choice)
- If encountering dying neurons, try Leaky ReLU or ELU
- For very deep networks, consider SELU
- For state-of-the-art performance, try GELU or Swish
For output layers:
- Binary classification: Sigmoid
- Multi-class classification: Softmax
- Regression: Linear (no activation)
For recurrent neural networks:
- LSTM/GRU cells: Sigmoid for gates, tanh for states
- General RNN hidden states: tanh or ReLU variants
For self-normalizing networks:
- Use SELU with LeCun normal initialization

Common Challenges and Solutions

Challenge	Affected Functions	Solutions
Vanishing Gradient	Sigmoid, Tanh	Use ReLU, ELU, or SELU instead
Exploding Gradient	Mostly a weight initialization issue	Use gradient clipping, batch normalization
Dying ReLU	ReLU	Use Leaky ReLU, PReLU, or ELU
Computational Efficiency	GELU, Swish	Use ReLU for deployment if speed is critical
Non-zero Mean Activations	ReLU, Sigmoid	Use tanh or batch normalization
Non-differentiable Points	ReLU, Leaky ReLU	Use smoothed variants if this causes issues

Best Practices and Practical Tips

Start Simple: Begin with ReLU for hidden layers and appropriate output activations
Experiment Systematically: If performance is unsatisfactory, try other activation functions
Combine with Normalization: Pair activation functions with batch normalization for better training
Consider Computational Budget: More complex activation functions add computational overhead
Watch for Warning Signs:
- If loss doesn’t decrease, you might have dying neurons (try Leaky ReLU)
- If gradients explode, consider gradient clipping
- If training is unstable, try more stable functions like ELU or SELU
Architecture-Specific Choices:
- For CNN: ReLU, Leaky ReLU, or Swish
- For RNN: tanh, ReLU variants
- For Transformers: GELU
Layer-Specific Choices: Consider using different activation functions for different layers
Initialize Properly: Some activations work best with specific weight initialization methods

Implementation Examples

PyTorch

import torch
import torch.nn as nn

# Common activation functions
sigmoid = nn.Sigmoid()
tanh = nn.Tanh()
relu = nn.ReLU()
leaky_relu = nn.LeakyReLU(negative_slope=0.01)
prelu = nn.PReLU()
elu = nn.ELU(alpha=1.0)
selu = nn.SELU()
softmax = nn.Softmax(dim=1)
gelu = nn.GELU()

# Swish/SiLU implementation
class Swish(nn.Module):
    def forward(self, x):
        return x * torch.sigmoid(x)

TensorFlow/Keras

import tensorflow as tf
from tensorflow.keras.layers import Activation
from tensorflow.keras import backend as K

# Common activation functions
sigmoid = Activation('sigmoid')
tanh = Activation('tanh')
relu = Activation('relu')
leaky_relu = Activation(tf.keras.layers.LeakyReLU(alpha=0.01))
elu = Activation('elu')
selu = Activation('selu')
softmax = Activation('softmax')
gelu = Activation(tf.keras.activations.gelu)

# Swish/SiLU implementation
def swish(x):
    return x * K.sigmoid(x)

tf.keras.utils.get_custom_objects().update({'swish': Activation(swish)})
swish_activation = Activation('swish')

Resources for Further Learning

Books

“Deep Learning” by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
“Neural Networks and Deep Learning” by Michael Nielsen

Research Papers

“Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification” (He et al., 2015) – Introduces PReLU
“Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)” (Clevert et al., 2015)
“Self-Normalizing Neural Networks” (Klambauer et al., 2017) – Introduces SELU
“Searching for Activation Functions” (Ramachandran et al., 2017) – Introduces Swish
“Gaussian Error Linear Units (GELUs)” (Hendrycks & Gimpel, 2016)

Ultimate Activation Functions Cheat Sheet: A Complete Guide for Deep Learning

Introduction: What Are Activation Functions and Why They Matter

Core Concepts and Principles

Common Activation Functions

Sigmoid Function

Hyperbolic Tangent (tanh)

Rectified Linear Unit (ReLU)

Leaky ReLU

Parametric ReLU (PReLU)

Exponential Linear Unit (ELU)

Scaled Exponential Linear Unit (SELU)

Softmax

Swish / SiLU (Sigmoid Linear Unit)

GELU (Gaussian Error Linear Unit)

Comparison Table of Activation Functions

When to Use Each Activation Function

Step-by-Step Selection Process

Common Challenges and Solutions

Best Practices and Practical Tips

Implementation Examples

PyTorch

TensorFlow/Keras

Resources for Further Learning

Books

Research Papers

Online Resources

Interactive Tools

Ultimate Activation Functions Cheat Sheet: A Complete Guide for Deep Learning

Introduction: What Are Activation Functions and Why They Matter

Core Concepts and Principles

Common Activation Functions

Sigmoid Function

Hyperbolic Tangent (tanh)

Rectified Linear Unit (ReLU)

Leaky ReLU

Parametric ReLU (PReLU)

Exponential Linear Unit (ELU)

Scaled Exponential Linear Unit (SELU)

Softmax

Swish / SiLU (Sigmoid Linear Unit)

GELU (Gaussian Error Linear Unit)

Comparison Table of Activation Functions

When to Use Each Activation Function

Step-by-Step Selection Process

Common Challenges and Solutions

Best Practices and Practical Tips

Implementation Examples

PyTorch

TensorFlow/Keras

Resources for Further Learning

Books

Research Papers

Online Resources

Interactive Tools

Related Posts