Activation Functions

An in-depth analysis of various activation functions used in deep learning

Dhairya Kumar
9 min readMay 31, 2020

If you have ever attended a lecture or read a book about deep learning, then you must have heard the term “activation functions”. Understanding how various activation functions work and why we use them will help you in building an intuition for solving deep learning problems.

In this article, I will focus mainly on non-linear activation functions and try to explain the issues and benefits associated with each of them. But first, let’s understand why we use non-linear activation functions instead of using linear activation functions.

Linear Activation Function

Let us consider a simple linear function.
The two main issues with a linear activation function are

1. A linear activation function turns a deep neural network into a shallow network

Let me explain what I mean by that. Let’s consider the above equation Y = cX If we apply this function to each layer in our neural network then after 'n' layers we will have a composite function having 'n' terms. But the issue is that it will still be a single function with respect to X i.e. the input.

If it is still not clear what the issue is, then let me explain in simple terms.

Suppose we have 3 layers in our neural network, then the output of the 3rd layer can simply be represented in terms of the input X. So, essentially we have only a single layer in our network.

It prevents us from solving complex deep learning problems which can’t be solved by creating a simple linear decision boundary.

2. A linear activation function is non-differentiable

If you are aware of the basics of deep learning, then you must have heard about the holy grail of deep learning i.e. Gradient Descent. It is a strategy which allows us to optimise our loss function or train our model. For Gradient Descent to work we need the derivatives with respect to the activation function. As you can see from the graph, the derivative of a linear activation function will always be 0, so it is non-differentiable and hence we can’t use linear activations with Gradient Descent.

Non-Linear Activation Functions

The main reason for using non-linear activation functions is that they allow us to model intricate mathematical relations, which in turn help us in solving complex deep learning problems.

There are various non-linear activation functions, I will explain some of them in detail here along with explaining the advantages and disadvantages of using them.

1. Sigmoid

Sigmoid Function

Advantages

  • The values are bounded between 0 and 1
  • It has a smooth gradient

Disadvantages

  • It is not zero-centred
  • It is computationally expensive to compute
  • It has a problem of vanishing gradients

Most of the articles explain activation functions like this. They just abruptly end the explanation and move on to the next function. I don’t want to do that here. I will try to answer the following questions in this article.

  • What exactly is the vanishing gradients problem?
  • When will the vanishing gradients problem occur?
  • Why being zero-centred is ideal for an activation function?

Hopefully, it will help you to better understand the concepts associated with activation functions.

Vanishing Gradients Problem

To understand the vanishing gradients problem, we need to look at two things. The first one is the graph of the sigmoid function and the other thing is the update rule in the Gradient Descent algorithm.

If you are not familiar with this update rule, then I will explain it briefly. Here, we just calculate the gradient with respect to the loss function and then subtract the product of this gradient and learning rate from the original weight to get the new and updated weight value.

Now, shift your focus to the graph of the sigmoid function. We can see that the value of the gradient will be zero when the value of sigmoid is either 0 or 1 or close to 0 or 1. We don’t need any mathematical theorems to prove this. Just by looking at the graph, we can see that sigmoid produces almost a straight line when the value is close to 0 or 1 and the slope of a straight line is 0.

Also, the derivative of the sigmoid activation function is S'(z) = S(z) * (1-S(z))

So, when S(z) is either 0 or 1, the derivative will be 0.

With this knowledge, let’s apply the update rule. If the gradient of the activation function is 0, then our update equation will be

W' = W Since η * ▽L = 0

So, our training process will come to a halt and we won’t be able to further train our model.

That is the first part of the problem, now let us understand when will the value of sigmoid be close to 0 or 1.

We apply the sigmoid activation function to the weighted sum of inputs (⅀WiXi). The value of sigmoid will be large if one of these values is large. So, we only have two suspects here and one of them caused this mayhem that we are now trying to investigate. Let’s first focus on suspect one i.e. X. Just think for a moment, can X be the culprit. If you think X is innocent, then you are right. X is indeed innocent. X or inputs can’t be large since we normalise them and they are always between 0 and 1.

Looks like we found our culprit. Large weight values cause the vanishing gradients problem in sigmoid. So, we should avoid initialising our weight values with large numbers to mitigate this problem.

Zero-Centred Functions

Let us consider the following neural network.

Neural Network
h01 and h02 are the inputs to the network h11, h12, h21, h22, h3 are the activation values i.e. values after the activation function is applied a11, a12, a21, a22, a3 are the pre-activation values i.e. values before the activation function is applied w1 and w2 are the weights for the last layer Y is the output

Let us calculate the derivative of the loss function with respect to w1 and w2.

▽w1 = ∂L/∂y * ∂y/h3 * ∂h3/∂a3 * ∂a3/∂w1 - (i) ▽w2 = ∂L/∂y * ∂y/h3 * ∂h3/∂a3 * ∂a3/∂w2 - (ii) a3 = w1 * h21 + w2 * h22 - (iii)

So, from the above equation, we can say that

∂a3/∂w1 = h21 - (iv) ∂a3/∂w2 = h22 - (v)

Substituting the values of (iv) and (v) in (i) and (ii), we get

▽w1 = ∂L/∂y * ∂y/h3 * ∂h3/∂a3 * h21 ▽w2 = ∂L/∂y * ∂y/h3 * ∂h3/∂a3 * h22

Let’s focus on ∂L/∂y * ∂y/h3 * ∂h3/∂a3

It is common for both these terms. As we know that h21 and h22 are the outputs of a sigmoid function, hence they will be between 0 and 1 or we can say that they will always be positive. So, now the value of the gradient depends on this common term. If the value of this term is positive then we will get a positive value otherwise we will get a negative value. This is true for all the layers of the network. Hence the gradient values for each layer in our network will either be all positive or all negative.

This is not very intuitive as you might be wondering how will it affect our training. So, let’s focus on that. It will affect our training process as it limits the direction of our movement. Consider this classic quadrant example.

Quadrant

In this case, we won’t be able to directly move in the direction of the optimal W, we would have to take a longer path and hence our model will take more time to converge.

So, now hopefully you might have understood why we prefer zero-centred activation functions.

Tanh

Tanh Function

Advantages

  • It squishes the values between -1 and 1
  • It is a zero-centred function

Disadvantages

  • It also has the vanishing gradients problem

Softmax

Softmax Function
Softmax Formula
  • It squishes the values between 0 and 1
  • The sum of all the values is equal to 1 i.e. it provides a nice probabilistic interpretation
  • It is typically used for output layers

ReLU

ReLU Function

Advantages

  • It solves the vanishing gradients problem
  • It is computationally less expensive
  • It converges faster

Disadvantages

  • It should only be used within the hidden layers of the network
  • It suffers from dying ReLU problem
  • The range of ReLU is [0,inf), so it can effectively blow up the activation

As I claimed earlier that this article won’t be like other articles where you are just told the advantages and disadvantages of various activation functions and you have to believe them. Rather I would explain in detail what are the issues associated with ReLU and how can we overcome them.

Dying ReLU Problem

Let’s start by analysing the ReLU function. It is a sparse function, what I mean by that is, it doesn’t consider the output for all the neurons. As you can see from the graph, its value is 0 when the input is negative. To better understand this, let’s first understand what sparsity means.

Sparsity here simply refers to the fact that ReLU considers the output of only some of the neurons, essentially the neurons that are activated or positive. We can understand it better by relating it to the biological neuron. There are billions of neurons in our body and all of them have a specific role. They fire only when they receive the required signals.

Suppose you are creating a classification model containing three classes: Human, Dog, Building. If there is a neuron which is responsible only for detecting eyes, then will it make sense for it to even be partially activated when we are classifying the image of a building. This is what sparsity brings to the table. The neurons which are not required will be completely switched off or set to zero.

Hope you must have understood the concept of Sparsity. We looked at the advantages of Sparsity, so now, let’s focus on the disadvantages of it. Sparsity is the reason that ReLU suffers from dying neuron or dying ReLU problem. The issue here is that if we have many negative values, then we will have many partial derivatives which are 0 as the derivative for negative values in ReLU is 0. If we have many negative values, then after a point of time our network might saturate and it won’t be able to learn further.

The main reason for getting a large number of negative ‘z’ values is setting the learning rate too high or using a large negative bias. So, setting a smaller learning rate can essentially mitigate the problem. But if the issue persists, then we have other alternatives like Leaky ReLU, Parametric ReLU etc.

I will explain them briefly.

Leaky ReLU and PReLU

Leaky and Parametric ReLU

Leaky ReLU has a small slope in the negative side instead of being completely zero. It helps in mitigating the dying ReLU problem.

PReLU is a special case of Leaky ReLU. It uses the same logic for fighting the dying ReLU problem. The only difference here is the term ‘a’. Instead of using a fixed value like 0.01, PReLU uses ‘a’, which is a parameter for the network to learn.

Advantages

  • It mitigates the dying ReLU problem
  • It speeds up the training process

References

And with that, we have come to the end of this article. Thanks a ton for reading it.

My LinkedIn, Twitter and Github
You can check out my website to know more about me and my work.

Originally published at https://alpha-dev.in on May 31, 2020.

--

--