Deep Learning Specialization Review
A review of topics from the deeplearning.ai Specialization on Coursera with additional commentary
Overview
The purpose of this article is to show what I learned from the Deep Learning Specialization on Coursera by Deeplearning.ai and give additional commentary on these topics. It is not a comprehensive list of everything included in the courses. It is intended to explain essential subjects in simple terms and will not be rigorous in any way. I hope that others having difficulty finding the intuitions behind these topics will be left with a more comfortable feeling when they hear about any of these topics and gain a better sense of understanding data science-related topics after reading this.
Neural Networks and Deep Learning
What is a neural network? A neural network is a function that inputs data features and outputs a prediction. More precisely, a neural network finds the optimal function in the form of the neural network that maps the inputs to the target variable. When I hear the term “neural network,” I think of a standard multi-layer perceptron network. That term may strike fear through some people, but the concept is straightforward. A multi-layer perceptron is a fancy way of saying a series of linear regressions with nonlinearities in between. This neural network has only a few parts, namely linear layers and activation functions. A linear layer is just like an ordinary linear regression. It takes the inputs and outputs a weighted combination of them. If a neural network only consisted of linear layers without any activation functions, it would only learn a weighted combination of the inputs the same way a linear regression does. Naturally, activation functions introduce nonlinearities so that the neural network can learn more complex processes.
Activation functions have an excellent idea behind them. If we think of each node of a neural network as a neuron, similar to ones within a brain, an activation function determines how much the given node will activate based on its inputs. A neuron in a neural network has three parts. A neuron is a weighted average of its inputs with an activation function and a bias. The bias in a neuron is the threshold it will activate at, given its inputs and weights, and is another learned variable.
How does a neural network learn? While the terms “forward propagation” and “backward propagation” may sound intimidating, the idea behind them is also simple. A forward pass of a neural network, or propagation, is computing the prediction of the network given the input. To make this prediction more accurate, you can take the distance from the prediction to the actual value of the data. At each part of the neural network starting from the end, multiply it by the amount that that part of the neural network activated. Then, we see what parts of the neural network were responsible for the prediction error, and they can be adjusted to be more accurate. Typically, these penalties are reduced by multiplying them by the learning rate. This hyperparameter is responsible for the step sizes in changes of the other variables of a neural network. In total, gradient descent is like taking a small step in the right direction through linear approximation, although this is not exactly true because of the nonlinearities.
To train a neural network in this way, you need to have data that contains the target variable. Machine learning in this context is called supervised learning. It is only possible when data is accurately labeled and can only achieve predictive power as accurately as the training data. In this way, it would not be possible to create a superhuman artificial intelligence then because of this fact. However, it is handy when data comes from another source other than being manually labeled, such as in financial applications or in other areas where target data is already labeled naturally.
More generally, deep learning models are trained through gradient descent. A gradient, in simpler terms, is the amount that a variable changes with respect to the other variables. To find the gradient of a variable within a neural network for the prediction error, you would take the gradient of the next furthest layer and multiply it by the activation of the current layer. This way, we can find how much each variable contributed to the activation in the next layer. Then, we attribute the error term to each part of the current layer based on its activations. Gradient descent is the application of the chain rule from calculus. If we think of the gradient as a function, then we should make our goal for training the network to find a minimum point in the gradient, which is exactly what gradient descent does. The error is a position function of the distance between the neural network prediction and the target variable, so its slope, or gradient, would be its first derivative. If the error then starts to increase, we could say then that the point where it changes from a negative slope to a positive slope is, at least, a local minimum point in the error.
The distance measurement of the error term may use a variety of functions, called loss functions. These arise from two main applications of supervised learning. There are classification tasks that attempt to predict categorical data and regression tasks that model a continuous variable. In a regression task, the loss function may be simply the distance formula. The distance formula would cause the gradients to have a non-concave surface in classification tasks because the target variable is not continuous. The discontinuity in the target variable creates large and sudden jumps in the gradient. It causes it to have multiple local minimums that weights of the neural network can get stuck in.
Classification tasks use cross-entropy loss to rectify this issue. In essence, the cross-entropy loss is equal to the logarithm of the prediction when the target variable is of the given category and the logarithm of one minus the prediction when the target variable is not of the given class. The magic of cross-entropy loss comes from using the certainty of the prediction, which is continuous, instead of the distance of the final prediction. This regression of the confidence of the prediction is called logistic regression. However, a loss function can be any function, and others are useful in practice.
Structuring Machine Learning Projects
Machine learning models optimize for a loss function. In a real-world scenario, their applications may have different objectives. For this reason, it is vital to choose appropriate evaluation metrics for them. It is challenging to optimize these metrics directly; the loss function is often not the same as the evaluation metric. A data scientist should carefully choose the loss function to fit the evaluation metric best. For example, a classification task that attempts to predict if someone has an illness may achieve very high accuracy even by randomly guessing if the disease is rare. Therefore, it does not make sense to evaluate the model only on accuracy because it does not say much about its predictive power. In this case, it may be critical to optimize the model to minimize false negatives if the illness is severe or life-threatening. While the precision of the model may not be as high, there will be a higher recall. It would be unmanageable to optimize for both false positives and false negatives. In many cases, the model is evaluated on a combination metric, such as the F1 score.
To fairly evaluate a machine learning model, some of the data is kept outside of the set of data that the model uses to train on, namely a validation set. Then, several models can be trained on the training data, and the model that performs best on the validation set should be chosen and evaluated on the test set. The idea behind holding out two different groups outside the training data is that the model fits the training set. It will perform better on this data because the gradient descent finds the model weights for maximum performance on the training set. Then, because we choose the model that performs best on the validation set, it is subject to the same bias. The purpose of the test set is to fairly evaluate the chosen model on data that it has not seen and that the model has not been selected to perform well on. This process is quintessential in model evaluation because it is the only way to remove these biases from training machine learning models and accurately represent how general the model will be. Validation and test sets are necessary because they will show how well the model performs on data that it has not seen before and allow us to quantify how general its predictions are.
There is an improvement that can be made for validation set evaluation. The K-fold cross-validation technique is instrumental when there is not an abundance of data or when the model can be retrained easily to give more reliable information about the model’s performance on unseen data. K-fold cross-validation is when the training set of data is partitioned K times, and for each partition, it is chosen as the validation set. Then, the model is retrained on the remaining sections. This way, the entire training set can be used as a validation set by holding out different parts and averaging the performances on each partition.
For completeness, I should mention that a machine learning model may have equal performance on unseen data as trained data. When a model is not performant on either the training set or the validation set, then the model suffers from high bias and is not complex enough to model the underlying process. Suppose the model performs well on data within the training set but poorly on data within the validation set. In that case, the model has problems with a high variance with enough complexity to memorize the training set. The issue of model complexity is the bias-variance trade-off. For this reason, it is paramount that models are quickly prototyped and iterated upon, as a simpler model may have problems with high bias. A too complex model for the given task will suffer from high variance, especially if the data is limited.
There are ways to combat a limited amount of data also. Especially when another model is similar in terms of its objective, transfer learning will be beneficial. The idea behind transfer learning is to use weights from a model trained on another task with similar features. The weights of this network will be closer to the optimal general solution for our problem. Then we can fine-tune the model by training it on our smaller set of data and save time and computer resources by avoiding training another model from scratch as many of the network’s weights will be within a closer neighborhood of their optimal values.
Hyperparameter Tuning, Regularization, and Optimization
Before I continue on the topic of the bias-variance trade-off and the process of designing machine learning projects, I should explain the difference between a hyperparameter and a parameter in the context of machine learning. A hyperparameter is one of a model’s characteristics, such as the number of layers, the number of neurons in a layer, the learning rate, or anything else that is not a trainable variable in the model. A parameter is the opposite of that. It is a trainable variable within the model.
Models with a relatively large number of parameters are more subject to the problem of high variance during evaluation due to their increased complexity. These models may need regularization to make them viable for real applications and improve their generality. Regularization can come as an increase in the error term equal to the sum of activations of the neurons in the model, or a squared version of the same sum, or a combination of the two multiplied by a constant that is another hyperparameter of the model. These regularization techniques are ridge, lasso, and elastic net regression, respectively.
Regularization aims to decrease the individual contributions of neurons so that a network does not learn to rely too heavily on one individual neuron, increasing the sparsity of activations within the network. A common and personal favorite way to increase the sparsity of a neural network is to use dropout connections within the model. During training, a dropout connection takes the previous layer’s outputs and randomly drops, or sets to zero, a portion of the connections before feeding it into the next layer. In effect, dropout forces the next layer to learn to predict with incomplete information and not rely too much on one or a small set of inputs.
Regularization relates to the bias-variance trade-off. Increasing the regularization of a network may increase the model’s bias due to incomplete information flow; however, this can help the model predict data that it has not been trained on because it may be overfitting to the training data. Overfitting is highly related to the problem of high variance when the model is simply trying to memorize the training data rather than learn the actual function behind the process. Underfitting is the opposite problem and is highly related to the problem of high bias in a model. If a model is underfitting, then it cannot model the process because it is too simple for the task. In this case, it is wise to decrease the regularization or possibly exclude it entirely.
To search for hyperparameters, you can train several models in parallel, varying a combination of its hyperparameters in a grid search. In a grid search, a grid is constructed with an axis for each hyperparameter to be varied. Usually, it is best to make these axes on a logarithmic scale so that hyperparameters that can vary by several orders of magnitude will be optimized much faster than if they were explored on a linear scale. Grid search helps optimize hyperparameters because different values can interact, so trying combinations of parameters at different scales will be more thorough than varying one hyperparameter at a time.
Hyperparameters are only one piece of the puzzle, however. In a neural network, the parameters are essential too. They must have a weight initialization scheme that allows each part of the neural net to say something different about the input data. The instantaneous linear approximations of the gradient would cause neurons that have the same value at initialization to remain with the same value throughout training. Weights must be initialized asymmetrically.
Further nuances of deep learning and the gradient are that data should be on the same measurement scale. If we think of the inputs to a neuron as the importance of that input, then this makes sense because it should have an equal scale for each feature to more easily determine the weights. This process is called normalization, and it helps the neural network to learn by smoothing out the function of the gradients. It is also helpful to normalize outputs of layers inside the neural network for the same reason with either batch normalization or layer normalization so that gradients between layers can be more uniform with respect to the inputs.
Other issues with the gradient calculation arise from the continual multiplication of gradients as they propagate backward through the network. At each layer, the gradients are calculated by multiplying the following layer’s gradients and the activation in the current layer. If these activations are less than one consistently, then the gradient will diminish as it reaches the first layers of the network. Likewise, with activations greater than one, gradients will snowball. This problem is called the exploding or vanishing gradient problem. It can be remedied by clipping gradients within a specified range of values.
Mini-batch gradient descent is used for a large corpus of data or quickly train neural networks. Mini-batch gradient descent is the process of computing a smaller subset of the data at a time and training the network on the batch before moving on to the next batch. This makes it possible to quickly take more of these small gradient steps instead of computing the total error on every example for each update to the network’s weights. Optimizers are used to decrease further the time it takes to train a neural network. Optimizers take the previous gradients and the current gradient and create a more direct path to the minimum place in the gradient landscape. These can be exponentially weighted moving averages like ADAM, the adaptive momentum optimizer, or any other function, including other neural networks.
When the error term fails to decrease further, the learning rate can be adjusted to take smaller steps for the neural network’s weights. This happens towards the end of the training for a neural network and is called learning rate decay. It allows the weights to become more precise by slowing their rate of change, but it is equally as important to choose a greater learning rate at the beginning of training to allow the neural network to train quickly.
Convolutional Neural Networks
Convolutional neural networks work similarly to a multi-layer perceptron neural network. However, they are specialized for data where features oriented closer to each other are more relevant, such as pixels in an image or a time-series application. They usually consist of a few unique kinds of layers. First is the convolutional layer with filters, usually in a grid, tiled over the input volume to identify features in the input volume. Convolutional neural networks often include pooling layers after sets of convolutional layers. Pooling layers constrain the volume of computations, limit the number of parameters in the model, and enhance numerical stability during training. Pooling layers are a function of the maximum or average activation of the region that it pools.
An activation function most commonly used in convolutional neural networks is the rectified linear unit or ReLU. The ReLU function is the max of zero and the neuron’s activation. It works well in practice and has an excellent intuitive explanation in terms of computer vision. For example, imagine a neural network that tries to identify if a cat is in a photograph. It would not make much sense to say that there is a negative amount of a feature in the image, whether that is a leg, door, or anything else.
Another essential part of many convolutional neural network architectures is the residual block. A residual block is a collection of layers packaged together and acts as one more performant layer. It creates connections between layers earlier and later in the neural network by adding the outputs from the previous block to the outputs of the sequential block. Residual connections also allow the gradients to reach the earlier parts of the network much more reliably and give the network a consistent flow of information to update its parameters.
Convolutional neural networks can also use other layers, such as a linear layer in the case of a classification task or key point regression. Key point regressions are used in filters for photographs or security applications. Other regressions can e used to generate a bounding box for the localization of objects within an image. Bounding box models sometimes incorrectly classify one object as several copies of the same object. However, intersection over union, or IOU, metrics can be helpful to penalize the model from doing this. IOU is a measure of the area shared between two bounding boxes, and the optimal threshold for these types of models is a common hyperparameter in this use case.
Another specific but common application of convolutional neural networks is style transfer. There is a loss function unique to this application that allows a convolutional neural network to learn the difference between the content versus the style of an image.
Sequence Models
There are recurrent neural networks for machine learning applications that deal with sequences, like natural language processing, time-series forecasting, or data with a variable-length input or output. A recurrent neural network is just like a regular neural network, except that the layers in a recurrent neural network often share the same weights. The outputs of the neural network are fed back into the neural network as inputs. This can include latent vectors that track the state of the process in the neural network or information that will be useful in deciding the future outputs of the neural network.
Until recently, there have been two main kinds of recurrent neural networks, namely the Long Short-Term Memory model, or LSTM, and the Gated Recurrent Unit model, or GRU. Both of these models have a linear layer and a hidden state. The hidden state tracks relevant information using a system of neurons that learn when to store new information, use held information, or delete the stored information.
Due to the recurrent nature of a recurrent neural network, the exploding or vanishing gradient problem is especially prevalent inside these models. Because there is only one or possibly a small set of layers repeatedly reused in the model, gradients tend towards zero or infinity if the gradient is multiplied repeatedly by a value less than one or greater than one. For this reason, gradient clipping is an integral part of the process of training a recurrent neural network. Gradient clipping ensures that the updates to the neural network’s weights are numerically stable.
Recurrent neural networks are especially prevalent in the area of natural language processing. The Word2vec algorithm can create word embeddings or neural network representations of words. The basis of the Word2vec algorithm is that words can be represented well by measuring how frequently they co-occur in each other’s contexts. A common approach is to mask a term in a sentence in the training data and train a neural network to predict what word is concealed. Then, the embeddings are the activations of the neural network trained in this fashion.
For natural language processing tasks, there are a few practical algorithms that help improve the predictions of the neural networks and evaluate them. Beam Search can improve the final predictions of a neural network where the model generates sentences. The most likely predictions that follow can be compared by taking the most likely outputs ranked by the neural network and feeding each state back into the neural network again. Then, the most credible products of previous and current word probabilities are chosen to produce the following words at each step. This is an improvement over relying solely on the neural network to predict the most likely sentence. When the neural network computes the next word, it would only give the single most likely word without consideration to the following words in the sentence in a greedy approach. Beam search width comes with a trade-off of speed versus accuracy, as larger widths will take longer to compute.