Neural Networks are powerful computational models known as universal approximators, a property described by the universal approximation theorem. This theorem asserts that a single hidden layer feedforward neural network, when equipped with S-shaped activation functions, has the capacity to approximate any measurable function to a desired level of accuracy on a compact set.

While theoretically, a single hidden layer can approximate any function, several important considerations and limitations arise. Specifically, although a neural network can achieve this approximation, the practical challenges of finding the necessary weights may require an exponentially large number of hidden units. Additionally, models that do not generalize well may become ineffective in real-world applications, despite their theoretical capabilities.

Model Complexity

The complexity of a neural network model is typically defined by the number of parameters it possesses. This complexity enables the model to fit different datasets effectively. To understand a model’s complexity fully, it is essential to consider the inductive hypothesis:

Hypothesis

if a solution approximates the target function well over a sufficiently large set of training examples, it is likely to generalize to unseen examples as well.

However, two scenarios can emerge in this context. In the first scenario, a model that is too simplistic relative to the data may underfit, failing to capture the underlying patterns. Conversely, if a model is overly complex and fits the training data too closely, it may overfit, becoming excessively tailored to the training set without performing well on new, unseen data.

The ideal model is one that demonstrates strong generalization capabilities, which means it can make accurate predictions on data it has not encountered during training. To evaluate a model’s generalization ability, it is critical to test it on an independent test set that was not part of the training process. Relying solely on training error can be misleading because this error is based on the very data used for training, leading to overly optimistic performance estimates. New data is unlikely to replicate the training data perfectly, and patterns can be mistakenly identified in random data as well. Thus, the most reliable method for assessing a model’s generalization is through an independent test set, ideally obtained from an external source. Furthermore, when working with classification problems, it is vital to maintain the class distribution by employing stratified sampling during data splits.

To clarify the terminology commonly used in the context of neural network training, several definitions are essential:

Definitions

  • Training Dataset: The complete set of data available for training the model.
  • Training Set: The specific data used for learning the model’s parameters.
  • Test Set: The data reserved for final model evaluation.
  • Validation Set: The data employed for model selection and assessment during the training phase.
  • Training Data: Data used for fitting and selecting the model.
  • Validation Data: Data utilized to evaluate model quality during the selection and assessment phases.

During the development of a neural network, the model is trained exclusively on the training dataset. It is crucial to keep the test data hidden from the developer to ensure an unbiased evaluation. If performance measurement is necessary during training, the validation set can be utilized.

Cross-Validation

Definition

Cross-validation is a vital technique that allows the training dataset to be used not only for training the model but also for estimating its performance on unseen data.

When ample data is available, one can employ a Hold-Out set for validation. However, in scenarios where data is limited, more sophisticated techniques such as Leave-One-Out Cross-Validation (LOOCV) or K-fold Cross-Validation can be employed.

K-fold Cross-Validation is particularly popular because it offers a balanced trade-off between bias and variance. In this method, the training dataset is divided into distinct subsets. The holdout method is then applied times, with each subset serving as a test set once, while the remaining subsets form the training set. This process results in models and estimates of the generalization error. Typically, the average of these estimates is taken as the final assessment of the model’s performance, formulated mathematically as follows:

Warning

While K-fold Cross-Validation can be applied at various stages of model development, such as model assessment and hyperparameter tuning, it is crucial to consider the computational cost associated with training multiple models.

Careful planning and resource allocation are necessary to effectively implement these cross-validation techniques without overwhelming computational resources.

Preventing Neural Networks Overfitting

Overfitting is a common problem in neural networks, where the model becomes too specialized to the training data and fails to generalize well to unseen data. Several techniques can help mitigate this issue, two of the most effective being early stopping and weight decay.

Early Stopping

One of the simplest and most effective methods to prevent overfitting is early stopping. When training a neural network using Stochastic Gradient Descent (SGD), we typically observe a monotonically decreasing training error as the number of iterations increases. However, while the training error continues to decrease, the model may start to lose its ability to generalize to new data. This phenomenon occurs because the network is becoming overly specific to the training set, capturing noise rather than true patterns.

To prevent this, we hold out a portion of the data as a validation set.

Algorithm

The steps are as follows:

  1. Train the model on the training set.
  2. Use the validation set to monitor the model’s performance during training.
  3. Stop training when the validation error starts to increase, indicating that the model has begun to overfit.

Early stopping works by ensuring that the model stops learning before it starts memorizing the training data. This method is effective because it uses the validation error, rather than just the training error, to guide the training process. By stopping early, we prevent the network from fitting the noise in the data.

It is important to note that model selection occurs at different levels:

  • Parameter Level: Adjusting the weights, , through the training process.
  • Hyperparameter Level: Deciding the architecture of the network, such as the number of layers, , or the number of hidden neurons in each layer, .

Weight Decay (L2 Regularization)

Weight decay, also known as L2 regularization, is another widely used technique for preventing overfitting. The core idea behind weight decay is to add a penalty term to the loss function, discouraging large weights. This method aims to limit the model’s capacity by constraining its “freedom,” thus reducing its likelihood of overfitting.

Instead of removing data from the training set, we can apply regularization techniques while utilizing all the data. Regularization imposes constraints based on prior assumptions about the model, which helps reduce overfitting.

The traditional approach of training a neural network involves maximizing the data likelihood, known as Maximum Likelihood Estimation (MLE):

However, in a Bayesian framework, we can reduce the model’s freedom by considering the prior distribution of weights and maximizing the posterior distribution, known as Maximum A Posteriori Estimation (MAP):

In practice, small weights tend to improve the generalization of neural networks. By assuming that the weights follow a Gaussian distribution with mean and variance , the MAP estimate becomes:

The objective function for weight decay is modified and be written as a minimization problem:

Misplaced & \hat{w} &= \text{argmax}_w P(w|D) = \text{argmax}_w P(D|w)P(w) \\ &= \text{argmax}_w \prod_{n=1}^N \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{t_n - g(x_n | w)^2}{2\sigma^2}} \prod_{q=1}^Q \frac{1}{\sqrt{2\pi\sigma_w^2}} e^{-\frac{w_q^2}{2\sigma_w^2}} \\ &= \text{argmin}_w \sum_{n=1}^N \frac{(t_n - g(x_n | w))^2}{2\sigma^2} + \sum_{q=1}^Q \frac{w_q^2}{2\sigma_w^2} \\ &= \text{argmin}_w \underbrace{\sum_{n=1}^N (t_n - g(x_n | w))^2}_{\text{Fitting}} + \gamma \underbrace{\sum_{q=1}^Q w_q^2}_{\text{Regularization}} \end{align}$$ In simpler terms, we are minimizing two components: 1. **Fitting term**: The error between the true target values $t_n$ and the predictions $g(x_n | w)$ made by the model. 2. **Regularization term**: The sum of the squared weights $w_q^2$, which discourages the model from using excessively large weights. $$ \hat{w} = \text{argmin}_w \left[ \sum_{n=1}^N (t_n - g(x_n | w))^2 + \gamma \sum_{q=1}^Q w_q^2 \right] $$ where, $\gamma$, the **regularization parameter**, controls the trade-off between fitting the data and keeping the weights small. When $\sigma_w \to 0$, the regularization term dominates, making $\gamma \to \infty$, and the model is heavily regularized. Conversely, if $\sigma_w \to \infty$, then $\gamma \to 0$, and the model is free to overfit. > [!recall] Cross-Validation and Hyperparameter Tuning > > > Cross-validation can be used to fine-tune the regularization parameter $\gamma$, ensuring that the model generalizes well: > > 1. Split the data into training and validation sets. > 2. Train the model using different values of $\gamma$ and minimize the following expression for the training set: > > $$ E_{\gamma}^{\text{TRAIN}} = \sum_{n=1}^{N_{\text{TRAIN}}} (t_n - g(x_n | w))^2 + \gamma \sum_{q=1}^Q w_q^2 $$ > > 3. Evaluate each model on the validation set by calculating: > > $$ E_{\gamma}^{\text{VAL}} = \sum_{n=1}^{N_{\text{VAL}}} (t_n - g(x_n | w))^2 $$ > > 4. Choose the value of $\gamma$ that results in the lowest validation error. > 5. Retrain the model using all the data (both training and validation sets) with the optimal value $\gamma^*$: > > $$ E_{\gamma^*}^{\text{TRAIN}} = \sum_{n=1}^{N_{\text{TRAIN} + \text{VAL}}} (t_n - g(x_n | w))^2 + \gamma^* \sum_{q=1}^Q w_q^2 $$ > ### Dropout: Limiting Overfitting by Stochastic Regularization **Dropout** is an effective technique for reducing overfitting in neural networks by introducing stochastic regularization. This method works by **randomly turning off a subset of neurons during training**, which encourages the network to learn independent features rather than relying on co-adapted units that create dependencies between weights. Such dependencies can hinder the model's ability to generalize, as it may become overly reliant on specific combinations of neurons. During training, each hidden unit in a layer is set to zero with a certain probability, denoted as $p_j^{(l)}$. For example, one might choose $p_j^{(l)} = 0.3$, indicating that 30% of the neurons will be dropped out for each mini-batch during training. This randomness **helps prevent the model from becoming too dependent on any individual neuron**, thus promoting more robust feature learning. To implement dropout, a **mask vector** $m^{(l)}$ is created for each layer, where each element $m_j^{(l)}$ in the vector follows a Bernoulli distribution with a probability $p_j^{(l)}$. This can be expressed mathematically as: $$ m^{(l)} = [m_1^{(l)}, m_2^{(l)}, \ldots, m_{J^{(l)}}^{(l)}] \quad \text{where }m_j^{(l)} \sim \text{Be}(p_j^{(l)}) , \forall j$$ The application of the dropout mask modifies the layer's output according to the rule: $$ h^{(l)} = h^{(l)}(W^{(l)}h^{(l-1)} \bullet m^{(l)}) $$ where, $W^{(l)}$ represents the weights of the layer, and $\bullet$ denotes the element-wise multiplication that applies the dropout mask to the layer's output. Dropout essentially **trains a collection of weaker classifiers on different mini-batches of the data**. Each training iteration can be viewed as a different "thinned" version of the network. At test time, we leverage the entire network, effectively averaging the predictions across all the ensemble members. This averaging helps to improve generalization since the network is less likely to overfit to the noise present in the training data. To maintain consistency between training and testing phases, it is important to **scale** the weights of the network during testing. The weights are scaled by $1 - p_j^{(l)}$, which effectively adjusts for the fact that during training, only a fraction of the neurons were active. This scaling ensures that the expected activation levels remain consistent between training and evaluation. Dropout can be applied selectively to different parts of the neural network. For instance, it may be beneficial to apply dropout to only a small subset of hidden units, or to specific layers such as the input layer or output layer. This flexibility allows practitioners to tailor dropout to their specific models and datasets, optimizing the regularization effect based on the architecture and the complexity of the data at hand. ## Tips and Tricks in Neural Network Training ### Better Activation Functions **Activation functions** play a crucial role in the training of neural networks, impacting the network's ability to learn complex representations. Traditional activation functions like Sigmoid and Tanh suffer from saturation issues, which can lead to vanishing gradients during backpropagation. When the gradients become very small (close to zero), the weight updates diminish, hindering the learning process, especially in deep networks. This phenomenon can be mathematically illustrated in the gradient computation: $$ \frac{\partial }{\partial w_{ji}^{(1)}} E(w_{ji}^{(1)}) = -2 \sum_{n=1}^N \left( t_n - g_1(x_n, w) \right) g_1'(x_n, w) w_{1j}^{(2)} h_j' \left( \sum_{j=0}^{J} w_{ji}^{(1)} x_{i,n} \right) x_{i} $$ The vanishing gradient problem is particularly prominent in deep networks and Recurrent Neural Networks (RNNs), often resulting in stalled learning and ineffective training. To address these challenges, the [Rectified Linear Unit (ReLU)](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) activation function was introduced, defined as: $$ g(a) = \text{ReLU}(a) = \max(0, a) \quad \text{and} \quad g'(a) = 1 \text{ if } a > 0 $$ | Advantages of ReLU | Disadvantages of ReLU | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **Faster SGD Convergence**:<br>ReLU significantly accelerates the convergence of stochastic gradient descent (SGD), achieving speeds up to six times faster compared to Sigmoid or Tanh. | **Non-Differentiability at Zero**:<br>While ReLU is non-differentiable at zero, it remains differentiable elsewhere, which generally does not impede training. | | **Sparse Activation**:<br>Only a portion of hidden units are activated at any given time, promoting a more efficient representation. | **Non-zero Centered Output**:<br>Since ReLU outputs are always positive, this can introduce bias in the gradients. | | **Efficient Gradient Propagation**:<br>ReLU effectively mitigates the vanishing and exploding gradient problems, facilitating better weight updates during training. | **Unbounded Output**:<br>The lack of upper bounds means that the activations can grow excessively large. | | **Computational Efficiency**:<br>The ReLU function involves simple thresholding at zero, making it computationally efficient. | **Dying Neurons**:<br>A common issue with ReLU is the occurrence of "dying neurons," where certain neurons become inactive across all inputs. This happens when the neuron is pushed into a region where it outputs zero, preventing any gradients from propagating back, effectively rendering the neuron useless. | | **Scale Invariance**:<br>The function maintains the property that $\max(0, ax) = a \max(0, x)$, allowing for consistent scaling. | | To mitigate these issues, several variants of ReLU have been developed: 1. **[Leaky ReLU](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)#Leaky_ReLU)**: This variant addresses the "dying ReLU" problem by allowing a small, non-zero gradient for negative inputs, defined as: $$ f(x) = \begin{cases} x & \text{if } x > 0 \\ 0.01x & \text{otherwise} \end{cases}$$ 2. **[Exponential Linear Unit (ELU)](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)#ELU)**: ELUs aim to bring the mean activations closer to zero, which can accelerate learning. The formula is given by: $$ f(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha (e^x - 1) & \text{otherwise} \end{cases}$$ where, $\alpha$ is a hyperparameter that can be tuned. ### Weights Initialization The initialization of weights in neural networks is a critical step that significantly influences the effectiveness and efficiency of gradient descent during training. Proper weight initialization can prevent issues like vanishing or exploding gradients, which are particularly problematic in deep networks. Various strategies exist for initializing weights. 1. **Zero Initialization**: Initializing all weights to zero is a poor choice because it leads to <u>symmetrical gradients across the network</u>. Since each neuron will receive the same gradient, they will learn the same features during training, which prevents the network from learning effectively. 2. **Large Value Initialization**: Starting with large weights can lead to very high gradients during backpropagation. If the weights are too large, the gradients can grow excessively as they propagate through the layers, resulting in numerical instability and making convergence very slow. 3. **Small Random Values**: Using small random values drawn from a normal distribution, such as $w \sim N(0, \sigma^2)$, is generally beneficial for small networks. However, this approach can be problematic for deeper networks, as weights that are too small can lead to vanishing gradients, where the gradients become so small that learning effectively ceases. In deep networks, if the weights are initialized too small, the gradients shrink as they propagate through each layer. Conversely, if the weights are initialized too large, the gradients can grow excessively, leading to values that are too large to be useful. To address these issues, researchers have proposed specialized initialization techniques. #### Xavier Initialization Xavier initialization (also known as [Glorot initialization](https://en.wikipedia.org/wiki/Weight_initialization#Glorot_initialization)) provides a method to set weights based on the number of input and output neurons in a layer. The goal is to maintain the variance of activations across layers, ensuring that the output variance matches the input variance. Suppose we have an input $x$ with $I$ components and a linear neuron with random weights $w$. The output of the neuron is given by: $$ h_j = w_{j1} x_1 + \ldots + w_{jI} x_I $$ To derive the variance of the output, we can express the variance of the product $w_{ji} x_i$: $$ \text{Var}(w_{ji} x_i) = E[x_i]^2 \text{Var}(w_{ji}) + E[w_{ji}]^2 \text{Var}(x_i) + \text{Var}(w_{ji}) \text{Var}(x_i) $$ Assuming both the inputs and weights have a mean of zero, the equation simplifies to: $$ \text{Var}(w_{ji} x_i) = \text{Var}(w_{ji}) \text{Var}(x_i) $$ If we assume all $w_{ji}$ and $x_i$ are independent and identically distributed (i.i.d.), the variance of the output $h_j$ can be expressed as: $$ \text{Var}(h_j) = I \cdot \text{Var}(w_{i}) \cdot \text{Var}(x_i) $$ For the variance of the input and the output to be equal, we want: $$ I \cdot \text{Var}(w_{j}) = 1 $$ This leads to Xavier's recommendation to initialize the weights as follows: $$ w \sim N\left(0, \frac{1}{n_{in}}\right) $$ Where $n_{in}$ is the number of input connections to the layer. #### He Initialization Building on the foundations of Xavier initialization, [Kaiming He proposed a modification](https://en.wikipedia.org/wiki/Weight_initialization#He_initialization) specifically for layers that use Rectified Linear Units (ReLU) as activation functions. His approach considers that ReLU activation functions tend to yield outputs of zero for half of their input range. Therefore, He initialization recommends: $$ w \sim N\left(0, \frac{2}{n_{in}}\right) $$ This adjustment accounts for the non-linearities introduced by ReLU, aiming to maintain variance stability across the layers of the network. ### Batch Normalization **Batch Normalization** is a technique that improves training speed and stability by addressing the issue of *internal covariate shift*, where the distribution of activations changes as the network trains. This shift can slow down learning or cause instability, particularly in deep networks. By normalizing the activations within each mini-batch, [[5 - CNN for Localization and Weakly Supervised Localization|Batch Normalization]] helps maintain consistent distributions across layers. Key features of Batch Normalization: - It ensures that each layer's activations have a mean of zero and a variance of one at the start of training, which helps with gradient flow and reduces the sensitivity to initial weight values. - It allows the use of higher learning rates, accelerating training and helping to avoid issues such as vanishing/exploding gradients. - It can act as a form of regularization, slightly reducing the need for techniques like dropout. ```mermaid graph LR A[Fully connected] --> B[Batch Norm] --> C[ReLU] ``` [[5 - CNN for Localization and Weakly Supervised Localization|Batch Normalization]] is applied after the fully connected (or convolutional) layer and before the activation function (e.g., ReLU). The process includes normalizing the activations over a mini-batch, scaling, and shifting the normalized activations using learned parameters to preserve the expressiveness of the network. For a mini-batch $B = \{x_1, \ldots, x_m\}$, with learned parameters $\gamma$ and $\beta$, the Batch Normalization process is as follows: 1. **Compute the mini-batch mean**: $$ \mu_B \leftarrow \frac{1}{m} \sum_{i=1}^m x_i $$ 2. **Compute the mini-batch variance**: $$ \sigma_B^2 \leftarrow \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2 $$ 3. **Normalize the activations**: $$ \hat{x}_i \leftarrow \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} $$ where, $\epsilon$ is a small constant added for numerical stability. 4. **Scale and shift**: $$ y_i \leftarrow \gamma \hat{x}_i + \beta $$ The parameters $\gamma$ (scale) and $\beta$ (shift) are learned during training to allow the network to determine the optimal level of normalization. > [!success] Benefits of Batch Normalization > > > - **Improved Gradient Flow**: By normalizing activations, gradients propagate more effectively, preventing issues like vanishing/exploding gradients. > - **Faster Training**: Networks can use higher learning rates, as Batch Normalization stabilizes training. > - **Less Sensitivity to Initialization**: Since Batch Normalization maintains consistent activations across layers, it reduces the network’s dependence on careful weight initialization. > - **Implicit Regularization**: It slightly reduces the need for other regularization techniques, like dropout, as it inherently introduces noise by using mini-batch statistics. ### Gradient Descent and Momentum #### Nesterov Accelerated Gradient Nesterov Accelerated Gradient (NAG) is an optimization technique that enhances the [standard momentum method](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Momentum) by incorporating a look-ahead mechanism. Unlike traditional momentum, where a "jump" is made based on the current gradient, NAG anticipates the next position, computes the gradient there, and then corrects the trajectory. The update steps are: 1. First, make a jump using the current momentum: $$ w_{k+ \frac{1}{2}} = w_k - \alpha \frac{\partial E(w)}{\partial w}|_{w^{k-1}} $$ 2. Then, correct the position using the gradient at the look-ahead position: $$ w_{k+1} = w_k - \eta \frac{\partial E(w)}{\partial w}|_{w_{k+ \frac{1}{2}}} $$ This method can lead to faster convergence and more stable optimization. ### Adaptive Learning Rates Neurons in different layers of a neural network learn at different rates, and adjusting the learning rates dynamically can improve performance. Several algorithms have been developed to provide adaptive learning rates: 1. **Resilient Propagation (Rprop)**: This algorithm adjusts the step size based on the sign of the gradient. It aims to overcome the issues of vanishing gradients by focusing on the direction of the gradient rather than its magnitude. 2. **Adaptive Gradient (AdaGrad)**: AdaGrad adapts the learning rate for each parameter based on the magnitude of the gradients for that parameter. This makes it useful for dealing with sparse data, as it allows parameters associated with infrequent features to have larger updates. 3. **RMSprop**: RMSprop is a variant of AdaGrad that addresses its tendency to reduce the learning rate too much. It does this by using a moving average of the squared gradients to normalize the learning rate. 4. **AdaDelta**: This method is a refinement of RMSprop, focusing on ensuring that the learning rate remains adaptive without needing to set a default learning rate manually. 5. **Adam**: Adam combines the advantages of both AdaGrad and RMSprop. It maintains per-parameter learning rates that are adapted based on both the first moment (mean) and the second moment (variance) of the gradients. Adam is widely used due to its robust performance across a variety of tasks.