Deep Learning activation functions examined below:

  1. ReLU
  2. Leaky ReLU
  3. sigmoid
  4. tanh

Activation plotting pleminaries

In [1]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

#Create array of possible z values
z = np.linspace(-5,5,num=1000)

def draw_activation_plot(a,quadrants=2,y_ticks=[0],two_quad_y_lim=[0,5], four_quad_y_lim=[-1,1]): 
    Draws plot of activation function
    a : Output of activation function over domain z.
    quadrants: The number of quadrants in the plot (options: 2 or 4)
    y_ticks: Ticks to show on the y-axis.
    two_quad_y_lim: The limit of the y axis for 2 quadrant plots.
    four_quad_y_lim: The limit of the y axis for 4 quadrant plots.
    #Create figure and axis
    fig = plt.figure()
    ax = fig.add_subplot(1, 1, 1)

    #Move left axis  
    #Remove top and right axes

    #Set x and y labels
    #Set ticks
    #Set ylim
    #4 Quadrant conditions
    if quadrants==4:
        #Move up bottom axis
        #Move x and y labels for readability
        ##Set y_lim for 4 quadrant graphs

    #Plot z vs. activation function

1. ReLU

A great default choice for hidden layers. It is frequently used in industry and is almost always adequete to solve a problem.

Although this graph is not differentiable at z=0, it is not usually a problem in practice since an exact value of 0 is rare. The derivative at z=0 can usually be set to 0 or 1 without a problem.

In [2]:
relu = np.maximum(z,0)


2. Leaky ReLU

Can be better than ReLU, but it is used less often in practice.

It provides a differentiable point at 0 to address the concern mentioned above.

In [3]:
leaky_ReLU = np.maximum(0.01*z,z)


3. sigmoid

Almost never used except in output layer when dealing with binary classification. It’s most useful feature is that it guarentees an output between 0 and 1.

However, when z is very small or very large, the derivative of the sigmoid function is very small which can slow down gradient descent.

In [4]:
sigmoid = 1/(1+np.exp(-z))

draw_activation_plot(sigmoid,y_ticks=[0,1], two_quad_y_lim=[0,1])

4. tanh

This is essentially a shifted version of the sigmoid function which is usually strictly better. The mean of activations is closer to 0 which makes training on centered data easier. tanh is also a great default choice for hidden layers.

In [5]:
tanh = (np.exp(z)-np.exp(-z))/(np.exp(z)+np.exp(-z))

Categories: Deep Learning

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts

Deep Learning

DeepLearning.AI: Course 1 ā€“ Week 2 Lecture Notes

I have recently started DeepLearning.AI’s Deep Learning Specialization on Coursera. Below are my lecture notes from the second week of the first course. The lectures examined vectorized Logistic regression as a neural network in preparation Read more…

Deep Learning

DeepLearning.AI: Course 1 – Week 1 Takeaways

I will be taking’s 5-course Deep Learning Specialization on Coursera. Here are my notes from week 1 (Introduction to deep learning) of the first course (Neural Networks and Deep Learning). 1. Neural Nets are Read more…