Ultimate Activation Functions Cheat Sheet: A Complete Guide for Deep Learning

Introduction: What Are Activation Functions and Why They Matter

Activation functions are mathematical operations applied to neural network nodes that determine whether a neuron should be activated or not based on the weighted sum of inputs. They introduce non-linearity into neural networks, enabling them to learn complex patterns and relationships in data that simple linear models cannot capture. Without activation functions, neural networks would simply perform linear regression, regardless of their depth.

Core Concepts and Principles

  • Neuron Activation: Transforms the weighted sum of inputs (z) into an output value (a)
  • Non-linearity: Enables neural networks to model complex, non-linear relationships
  • Differentiability: Most activation functions need to be differentiable for backpropagation
  • Vanishing/Exploding Gradient Problem: Some activation functions can lead to gradients becoming too small or too large during training
  • Sparsity: Certain activation functions promote sparse activations (many neurons output zero)
  • Computational Efficiency: Some functions are more computationally expensive than others

Common Activation Functions

Sigmoid Function

Formula: σ(z) = 1 / (1 + e^(-z))
Range: (0, 1)
Properties:

  • Smooth and differentiable everywhere
  • Historically popular but rarely used as hidden layer activation in modern networks
  • Suffers from vanishing gradient problem for values far from zero
  • Outputs are not zero-centered

Use Cases:

  • Output layer for binary classification
  • Gates in LSTM and GRU units

Hyperbolic Tangent (tanh)

Formula: tanh(z) = (e^z – e^(-z)) / (e^z + e^(-z))
Range: (-1, 1)
Properties:

  • Zero-centered outputs
  • Steeper gradients than sigmoid
  • Still suffers from vanishing gradient problem
  • Squishes inputs to a range between -1 and 1

Use Cases:

  • Hidden layers in shallow networks
  • Recurrent neural networks
  • When zero-centered activations are preferred

Rectified Linear Unit (ReLU)

Formula: ReLU(z) = max(0, z)
Range: [0, ∞)
Properties:

  • Computationally efficient
  • Helps mitigate vanishing gradient problem
  • Non-differentiable at z=0
  • Can suffer from “dying ReLU” problem (neurons that always output 0)

Use Cases:

  • Default choice for hidden layers in CNNs
  • Most feed-forward neural networks
  • Deep networks

Leaky ReLU

Formula: Leaky_ReLU(z) = max(αz, z) where α is a small constant (e.g., 0.01)
Range: (-∞, ∞)
Properties:

  • Allows small negative values when input is less than zero
  • Helps prevent dying ReLU problem
  • Preserves some gradient for negative inputs

Use Cases:

  • Alternative to ReLU when neuron death is a concern
  • Deep networks with potential for sparse gradients

Parametric ReLU (PReLU)

Formula: PReLU(z) = max(αz, z) where α is a learnable parameter
Range: (-∞, ∞)
Properties:

  • Learnable slope for negative inputs
  • Can adapt during training
  • May overfit on small datasets

Use Cases:

  • When you want the network to learn the optimal negative slope
  • Deep networks with sufficient training data

Exponential Linear Unit (ELU)

Formula: ELU(z) = z if z > 0, α(e^z – 1) if z ≤ 0
Range: (-α, ∞)
Properties:

  • Smooth function (differentiable everywhere)
  • Can produce negative outputs
  • Helps mitigate vanishing gradient problem
  • Computationally more expensive than ReLU

Use Cases:

  • Deep neural networks
  • When negative values need to be handled differently than in ReLU variants

Scaled Exponential Linear Unit (SELU)

Formula: SELU(z) = λ * ELU(z, α)
Range: (λ * -α, ∞)
Properties:

  • Self-normalizing properties
  • Automatically ensures normalized outputs
  • Helps networks converge faster
  • Requires specific weight initialization (LeCun normal)

Use Cases:

  • Deep feed-forward networks
  • Self-normalizing neural networks
  • When batch normalization isn’t feasible

Softmax

Formula: softmax(z)ᵢ = e^zᵢ / Σⱼ e^zⱼ
Range: (0, 1) for each output, with sum = 1
Properties:

  • Converts logits to probability distribution
  • Outputs sum to 1
  • Emphasizes the largest values while suppressing lower ones

Use Cases:

  • Output layer for multi-class classification
  • When probability distribution over classes is needed

Swish / SiLU (Sigmoid Linear Unit)

Formula: Swish(z) = z * sigmoid(z) = z * (1 / (1 + e^(-z)))
Range: (-0.278, ∞)
Properties:

  • Smooth, non-monotonic function
  • Outperforms ReLU in many deep models
  • Self-gating property
  • Computationally more expensive than ReLU

Use Cases:

  • Modern deep neural networks
  • When slightly better performance than ReLU is needed

GELU (Gaussian Error Linear Unit)

Formula: GELU(z) = z * Φ(z) where Φ is the cumulative distribution function of the standard normal distribution
Range: (-∞, ∞)
Properties:

  • Smooth, non-monotonic function
  • Combines aspects of dropout and ReLU
  • Used in state-of-the-art transformer models

Use Cases:

  • Transformer models (BERT, GPT)
  • Deep networks where performance is prioritized over computational efficiency

Comparison Table of Activation Functions

Activation FunctionRangeDifferentiableZero-CenteredComputation CostVanishing GradientDying Neurons
Sigmoid(0, 1)YesNoMediumYesNo
Tanh(-1, 1)YesYesMediumYesNo
ReLU[0, ∞)No (at x=0)NoLowNoYes
Leaky ReLU(-∞, ∞)No (at x=0)NoLowNoReduced
PReLU(-∞, ∞)No (at x=0)NoLowNoReduced
ELU(-α, ∞)YesNoMediumReducedNo
SELU(λ*-α, ∞)YesApprox.MediumNoNo
Swish(-0.278, ∞)YesNoMediumReducedNo
GELU(-∞, ∞)YesNoHighReducedNo
Softmax(0, 1)YesNoHighN/ANo

When to Use Each Activation Function

Step-by-Step Selection Process

  1. For hidden layers in standard feedforward networks:

    • Start with ReLU (default choice)
    • If encountering dying neurons, try Leaky ReLU or ELU
    • For very deep networks, consider SELU
    • For state-of-the-art performance, try GELU or Swish
  2. For output layers:

    • Binary classification: Sigmoid
    • Multi-class classification: Softmax
    • Regression: Linear (no activation)
  3. For recurrent neural networks:

    • LSTM/GRU cells: Sigmoid for gates, tanh for states
    • General RNN hidden states: tanh or ReLU variants
  4. For self-normalizing networks:

    • Use SELU with LeCun normal initialization

Common Challenges and Solutions

ChallengeAffected FunctionsSolutions
Vanishing GradientSigmoid, TanhUse ReLU, ELU, or SELU instead
Exploding GradientMostly a weight initialization issueUse gradient clipping, batch normalization
Dying ReLUReLUUse Leaky ReLU, PReLU, or ELU
Computational EfficiencyGELU, SwishUse ReLU for deployment if speed is critical
Non-zero Mean ActivationsReLU, SigmoidUse tanh or batch normalization
Non-differentiable PointsReLU, Leaky ReLUUse smoothed variants if this causes issues

Best Practices and Practical Tips

  • Start Simple: Begin with ReLU for hidden layers and appropriate output activations
  • Experiment Systematically: If performance is unsatisfactory, try other activation functions
  • Combine with Normalization: Pair activation functions with batch normalization for better training
  • Consider Computational Budget: More complex activation functions add computational overhead
  • Watch for Warning Signs:
    • If loss doesn’t decrease, you might have dying neurons (try Leaky ReLU)
    • If gradients explode, consider gradient clipping
    • If training is unstable, try more stable functions like ELU or SELU
  • Architecture-Specific Choices:
    • For CNN: ReLU, Leaky ReLU, or Swish
    • For RNN: tanh, ReLU variants
    • For Transformers: GELU
  • Layer-Specific Choices: Consider using different activation functions for different layers
  • Initialize Properly: Some activations work best with specific weight initialization methods

Implementation Examples

PyTorch

import torch
import torch.nn as nn

# Common activation functions
sigmoid = nn.Sigmoid()
tanh = nn.Tanh()
relu = nn.ReLU()
leaky_relu = nn.LeakyReLU(negative_slope=0.01)
prelu = nn.PReLU()
elu = nn.ELU(alpha=1.0)
selu = nn.SELU()
softmax = nn.Softmax(dim=1)
gelu = nn.GELU()

# Swish/SiLU implementation
class Swish(nn.Module):
    def forward(self, x):
        return x * torch.sigmoid(x)

TensorFlow/Keras

import tensorflow as tf
from tensorflow.keras.layers import Activation
from tensorflow.keras import backend as K

# Common activation functions
sigmoid = Activation('sigmoid')
tanh = Activation('tanh')
relu = Activation('relu')
leaky_relu = Activation(tf.keras.layers.LeakyReLU(alpha=0.01))
elu = Activation('elu')
selu = Activation('selu')
softmax = Activation('softmax')
gelu = Activation(tf.keras.activations.gelu)

# Swish/SiLU implementation
def swish(x):
    return x * K.sigmoid(x)

tf.keras.utils.get_custom_objects().update({'swish': Activation(swish)})
swish_activation = Activation('swish')

Resources for Further Learning

Books

  • “Deep Learning” by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
  • “Neural Networks and Deep Learning” by Michael Nielsen

Research Papers

  • “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification” (He et al., 2015) – Introduces PReLU
  • “Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)” (Clevert et al., 2015)
  • “Self-Normalizing Neural Networks” (Klambauer et al., 2017) – Introduces SELU
  • “Searching for Activation Functions” (Ramachandran et al., 2017) – Introduces Swish
  • “Gaussian Error Linear Units (GELUs)” (Hendrycks & Gimpel, 2016)

Online Resources

Interactive Tools

Scroll to Top