/ #Deep Learning #Neural Network

# 1 Introduction

Let’s say you have a data sets with six houses, so you know the size of the houses in square feet or square meters and you know the price of the house and you want to fit a function to predict the price of the houses. Naturally you are thinking of some curve fitting and more technically you are thinking of linear regression. Well let’s put a straight line to these data so and we get a straight line. Source of the material is coursera -> deep learning course

But hang-on the prices can never be negative, so instead of the straight line fit which eventually will become negative, we something that never becomes negative. So function for predicting the price of the house its zero till a point and then there’s a straight increasing line as the size increases.

So you can think of this function that you’ve just fit the housing prices as a very simple neural network. It’s almost as simple as possible neural network. And by the way in the neural network literature, you see this function a lot. This function which goes to zero sometimes and then it’ll takes of as a straight line. This function is called a ReLU function which stands for rectified linear units (R-E-L-U). And rectify just means taking a max of 0 which is why you get a function shape like this.

## 1.1 Adding More Complexity Layer by Layer

Suppose for the same problem now say that instead of predicting the price of a house just from the size, you now have other features. You know other things about the host, such as the number of bedrooms needed and you might think that one of the things that really affects the price of a house is family size. But this in-turn is based on the size of house in sq.meters as also the number of bedrooms.

There is also the location or zip code/pin code, this determines factors like quality of area, distance to major centers, walk-ability. In turn this is also influenced by the kind of schooling present in the area. All these leads us to some structure like the above.

## 1.2 Implementation Structure

So what you actually implement (formally) is the above structure. Where, here, you have a neural network with four inputs. So the input features might be the size, number of bedrooms, the zip code or postal code, and the wealth of the neighborhood. And so given these input features, the job of the neural network will be to predict the price y. And notice also that each of these circle, these are called hidden units in the neural network, that each of them takes its inputs all four input features. So for example, rather than saying these first nodes represent family size and family size depends only on the features X1 and X2.

So now we have a input layer with x1,x2,x3,x4. And the layer in the middle (hidden) of the neural network are density connected. Because every input feature is connected to every one of these circles in the middle. And the remarkable thing about neural networks is that, given enough data about x and y, given enough training examples with both x and y, neural networks are remarkably good at figuring out functions that accurately map from x to y.

# 2 Logistic Regression

Logistic Regression is a learning algorithm that you use when the output labels ‘Y’ in a supervised learning problem are all either zero or one, so for binary classification problems. Given an input feature vector ‘X’ maybe corresponding to an image that you want to recognize as either a cat picture or not a cat picture, you want an algorithm that can output a prediction, which we’ll call $\hat Y$, which is your estimate of ‘Y’. More formally, you want $\hat Y$ to be the probability of the chance that, ‘Y’ is equal to one given the input features ‘X’.

Now there is going to some multiplier to the given ‘X’ which decides the influence of ‘X’ on $\hat Y$ and there needs to be a bias (cutoff) point after which the multiplier should really kick in. Lets call this multiplier as ‘W’ and the bias as ‘b’. Leading us to:

\begin{align} \hat Y = W^T.X + b \tag{2.1} \end{align}

Why the transpose on W? The X is a vector of size ‘n x 1’ so ‘W’ needs to be ‘1 x n’ so they can be matrix multiplied. Its standard to use vectors in machine learning. Coming back to our equation.Since we are dealing with a binary output and as such our equation above is that of a regression. In order our us to make this into a binary output lets wrap this by a function called sigmoid function called sigma and denoted by $\sigma$. Thus our equation is

\begin{align} Output~ = \hat Y = \sigma{(W^T.X + b)} \tag{2.1} \end{align}

This is how sigmoid looks like:

sigmoid = function(x) {
1 / (1 + exp(-x))
}

x <- seq(-5, 5, 0.01)
plot(x, sigmoid(x), col='blue')

And the equation of sigmoid is $\frac{1}{1+e^{-z}}$ Let’s called the $(W^T.X + b)$ as ‘Z’. How does doing all the above help us. When ‘Z’ is a large value the sigmoid function becomes 1 since $(e^-\infty)$ is zero. And when ‘Z’ is zero the fraction becomes $\frac{1}{2}$ since $e^0$ is 1. And when ‘Z’ is negative the term tends to ‘0’. Hence the S shape and our output is from 0 to 1.

## 2.1 Loss function of Logistic Regression

In order for us to build/train a model we need to define a ‘loss function’. What’s a cost function? Simply put its the function that provides a measure as to how well our model is doing for a given data-point.

However we first need to define the loss function ‘L’. Recall that model is trained on ‘n’ variables, hence for each of this points we will have a loss function, such that the following becomes true.

$(x^{1},y^{1}),(x^{2},y^{2})........,(x^{m},y^{m}) ~we~want~ \hat{y^{i}} \approx y^{i}]$

Now we can define loss function as ‘squared loss’, however for logistic regression the optimization problem of finding ‘W’ and ‘b’ tends to be a non-convex solution, but the intuition/ take away is that this function ‘L’ called the loss function is a function you’ll need to define to measure how good our output $\hat{Y}$ is when the true label is ‘Y’. In our case the loss function will be the following:

$L(\hat{Y},Y)~= - (Y \log(\hat{Y} + (1-Y) \log(1-\hat{Y}))$ ## Why does this loss function makes sense?

• Case 1: ‘Y’=1 Then loss function is

$L(\hat{Y},1)~= - Y \log(\hat{Y})$ This implies that if ‘Y’=1 you want $log(\hat{Y})$ to be as large as possible, since we know $\hat{Y}$ is a sigmoid function then this corresponds to $\hat{Y}$ to be as close to 1.

• Case 2: ‘Y’= 0 Then loss function is $L(\hat{Y},0)~= -\log(1-\hat{Y})$ This implies that if ‘Y’=0 you want $log(1 - \hat{Y})$ to be as small as possible, since we know $\hat{Y}$ is a sigmoid function then this corresponds to $\hat{Y}$ to be as close to 0.

## 2.2 Loss function over all cases or Cost function

In the above case we applied the loss function calculation on one single training example, but the loss function needs to be calculated over the whole data-set and this is called as ‘Cost function’, often denoted by ‘J’, where J is given by the following equation.

$Cost~function~J(w,b) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{Y^{(i)}},Y^{(i)}) = \frac{1}{m} \sum_{i=1}^{m} (Y^{(i)} \log(\hat{Y^{(i)}} + (1-Y^{(i)}) \log(1-\hat{Y^{(i)}}))$

It turns out that logistic regression can be viewed as a very very small neural network.

We want to minimize our cost function ‘J(w,b)’ which as one can see is the function of two parameters ‘w’ and ‘b’.

In the figure above we can start anywhere with randomly chosen ‘b’ and ‘w’ and what we can do is to follow the path in the direction of the steepest downhill-descent. This is what essentially what gradient descent does.

Gradient descent does the following until there is no change in ‘w’ or ‘b’.

$w = w - \alpha \frac{dJ(w)}{dw}$

$b = b - \alpha \frac{dJ(w)}{db}$

where $\alpha$ is the learning rate.

## 2.4 Applying gradient descent to logistic regression

Lets recall that

$z = w^Tx + b = x_1 w_1 + x_2 w_2 + b$

$\hat{Y} = a = \sigma(z) = \frac{1}{1+e^{-z}} = \frac{1}{1+e^{-w^Tx-b}}$

$L(a,y) = -(y\log(a) + (1-y) \log(1-a)$

$\frac{dL(a,y)}{da} = - (\frac{y}{a} - \frac{(1-y)}{(1-a)}) = - \frac{y}{a} + \frac{(1-y)}{(1-a)}$

$\frac{dL(a,y)}{dz} = \frac{dL(a,y)}{da} \frac{da}{dz}$ Lets find $\frac{da}{dz}$

$\frac{da}{dz} = \frac{1}{dz}.\frac{1}{1+e^{-z}} = \frac{e^{-z}}{(1+e^{-z})^2} = a(1-a)$

$\frac{dL(a,y)}{dz} = \frac{dL(a,y)}{da} \frac{da}{dz} = - \frac{y}{a} + \frac{(1-y)}{(1-a)} \times a(1-a) = a-y$

$\frac{dL(a,y)}{dw_{1}} = \frac{dL(a,y)}{dz} \frac{dz}{dw_1} = x_1 (a-y)$

$\frac{dL(a,y)}{dw_{2}} = \frac{dL(a,y)}{dz} \frac{dz}{dw_2} = x_2 (a-y)$ $\frac{dL(a,y)}{b} = \frac{dL(a,y)}{db} \frac{dz}{b} = (a-y)$

## 2.5 Final Results

$w_1 = w_1 - \alpha x_1[a-y]$

$w_2 = w_2 - \alpha x_2[a-y]$

$b = b - \alpha [a-y]$

# 3 Building a Neural Network

From above we got an intuitive idea about a neural network. For a given input we sum the inputs, apply a bias, make it go through an activation function and produce an output.

However the real power of neural networks is when we add layers, essentially the above unit is what gets repeated with different inputs, activation functions and layers, like the below image.

For ‘m’ training examples we get

## 3.1 Activation function

Among the many activation functions available, the default today is the ‘ReLU’ function or even leaky ReLU, sigmoid is hardly ever used due to the fact that there is vanishing gradient problem(the slope is 0 at the extreme ends), however it can used used in the final output layer if the output happens to be a binary variable since the function varies from 0 to 1. The tanh function is often a better substitute in-place of sigmoid especially for the hidden layer since the function varies from -1 to 1, this works in the same way as when we center the data around ‘0’.

## 3.2 The need for non-linear activation function

Lets assume that function $g^{[1]}$ which happens to be say ‘tanh’ is a linear function. Then we get the following equations.

• layer one (hidden layer) $z^{[1]} = W^{[1]} x + b^{[1]}$

• activation function ‘a’ is now linear, which leads to $a^{[1]} = g^{[1]}(z^{[1]}) = z^{[1]} = W^{[1]} x + b^{[1]}$

• Second layer (output layer) $z^{[2]} = W^{[2]} a^{[1]} + b^{[2]} = W^{[2]} (W^{[1]} x + b^{[1]}) + b^{[2]}$
• Finally we get

$z^{[2]} = W^{[2]} W^{[1]} x + W^{[2]}b^{[1]} + b^{[2]} = W^{'}x+ b^{'}$

$W^{'} = W^{[2]} W^{[1]} x$

$b^{'} = W^{[2]}b^{[1]} + b^{2}$

As we can see the finally output is nothing but a linear combination of input function. Thus the hidden layers are not doing here anything. However if in our model we have output layer using sigmoid and hidden layers use linear activation then we basically get ‘logistic regression’.

So when is linear activation functions useful. One possible case is regression, if the hidden layers use ‘tanh’/‘ReLU’ function for activation and output layer uses say ‘g(x)=z’ then we can perform linear regression.

## 3.3 Derivatives of activation functions

### 3.3.1 Sigmoid Function

$a = g(z) = \frac{1}{1+e^{-z}} \\ g^{'}(z) = \frac{d}{dz} g(z) = \frac{1}{1+e^{-z}} (1-\frac{1}{1+e^{-z}}) = g(z) (1-g(z))$

### 3.3.2 Tanh Function

$a = g(z) = \frac{e^z - e^{-z}}{e^z+e^{-z}} \\ g^{'}(z) = \frac{d}{dz} g(z) = 1 - (tanh(z))^2= 1 - g(z)^2 = 1 - a^2$

### 3.3.3 ReLU Function

$$$a = g(z) = max(0,z) \\ g^{'}(z) = \begin{cases} 0 & \text{if } z < 0 \\ 1 & \text{if } z \geq 0 \\ undefined & \text{if } z = 0 \\ \end{cases}$$$

### 3.3.4 Leaky ReLU Function

$$$a = g(z) = max(0.01z, z) \\ g^{'}(z) = \begin{cases} 0.01 & \text{if } z < 0 \\ 1 & \text{if } z \geq 0 \\ undefined & \text{if } z = 0 \\ \end{cases}$$$

## 3.4 Gradient Descent for Neural Networks

Assume a two layer network, one hidden and one output. Let the activation function be sigmoid

### 3.4.1 Forward Propogation

• First layer (hidden layer)

• W is the weight matrix
• b is the bias matrix
• X is the inputs

$z^{[1]} = W^{[1]}X + b^{[1]}$

• Output from the hidden layer

• g is the activation function

$A^{[1]} = g^{[1]} (z^{[1]})$

• Input to the output layer

$z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}$

• Output from the output layer

$A^{[2]} = g^{[2]} (z^{[2]}) = \sigma(z^{[2]})$

• In general for a ‘l’ layer network, the forward propagation is given by:

$z^{[l]} = W^{[l]} \cdot a^{[l-1]} + b^{[l]}$ $a^{[l]} = g^{[l]}(z^{[l]})$

### 3.4.2 Backward Propogation

$\frac{dL}{dz} = dz^{[2]} = a^{[2]} - y \\$ $\frac{dL}{dw^{[2]}} = dw^{[2]} = dz^{[2]} a^{[1]T} =(a^{[2]} - y) a^{[1]T} \\$ $\frac{dL}{db^{[2]}} = db^{[2]} = dz^{[2]} = (a^{[2]} - y) a^{[1]T} \\$

$\frac{dL}{dz^{[1]}} = dz^{[1]} = w^{[2]T} dz^{[2]} * g^{[1]'}(z^{[1]}) = w^{[2]T} (a^{[2]} - y) a^{[1]T} * g^{[1]'}(z^{[1]}) \\$ $\frac{dL}{db^{[1]}} = dz^{[1]} = w^{[2]T} dz^{[2]} * g^{[1]'}(z^{[1]}) = w^{[2]T} (a^{[2]} - y) a^{[1]T} * g^{[1]'}(z^{[1]}) \\$

## 3.5 Weights Initialization

It is important that the weights used in the neural networks be random and different (not all zeros), otherwise all the hidden nodes will learn the same features. Gradient descent running no matter however long will not help either.

This problem does not apply to the bias vector, they can be all same and even zero. Also to be keep in mind is that weights should never be too large or too small. Too large weights makes the weights to be saturated and having too small will make learning too small.

In practice we multiply the randomly initialized weights with a constant number like so:

weights = rnorm(n = 10, mean = 0, sd = 1) * 0.01
weights
##  [1]  0.0113896425 -0.0140915810 -0.0060776724 -0.0088922880 -0.0028578727
##  [6]  0.0002917322 -0.0041145845 -0.0149229498  0.0125772361  0.0095581917

The constant of 0.01 is fine for shallow neural networks but for deeper there are better numbers that can be used.

## 3.6 Deep Layers Learning

The idea here is that when we use a neural network, the first few layers learn simple features “edge detection” while the latter layers learn more complex features “face recognition”. This is the basis for deep neural networks. In deep-learning we use neural networks with higher number of layers instead of increasing the number of nodes. This is because in general the growth of layers is log(n) while the growth of number of nodes is exp(n) for a given task.

However the number of layers is actually a hyper parameter that we need to tune.

### 3.6.1 Hyper Parameters

Unlike the traditional machine learning, the number of parameters in deep-learning are much higher. There are also new elements called hyper-parameter. The classification is as follows:

• Parameters

• Weights
• biases
• outputs from each layer

• Hyper-parameters

• Number of nodes
• Number of hidden layers
• Learning rate
• Momentum
• Activation functions

### 3.6.2 Validation of Model Performance

Another deviation from the traditional machine learning is the split of train/test or train/validation/test. In the world of Big Data and Deep learning with millions of data points, its not rare to use 98% of data as train, 1% of data for each validation and test. This can given go higher to almost 99.5% of data used as train in some cases. Thus depending the data set size the ratios of the split varies from the traditional 70/15/15 to the extreme 99.5/0.5/0.5.

### 3.6.3 Bias and Varience of the model

Imagine if we build our model which predicts class of animal. Cat or Dog. If our model always predicts that a given image is cat then such a model is called ‘high bias’ model. If our model has almost 0% error in training but very high in test/validation set (say 30%) then we say the model has ‘high variance’. Ideally the models we build should have low bias and low variance.

To summaries using the ‘training error’ we come to know whether or not our model is ‘biased’, while seeing the ‘test error’ we come to know whether the model has ‘high variance’.

In the traditional machine learning, the bias and variance of the model had the classic ‘bias-variance trade off’, where more complex model increased the variance but reduced the bias, however in deep learning, having a bigger network allows to get the ideal low bias and low variance stage.

### 3.6.4 Recipe of deep learning model building

For a given model do the following:

• Does the model have high bias (high training error)?
• Build a bigger network
• Increase the training time
• Try a different neural network architecture
• After reducing the training error/high bias then look at variance problem, is the variance high?