Neural Networks are powerful computational models known as universal approximators, a property described by the universal approximation theorem. This theorem asserts that a single hidden layer feedforward neural network, when equipped with S-shaped activation functions, has the capacity to approximate any measurable function to a desired level of accuracy on a compact set.
While theoretically, a single hidden layer can approximate any function, several important considerations and limitations arise. Specifically, although a neural network can achieve this approximation, the practical challenges of finding the necessary weights may require an exponentially large number of hidden units. Additionally, models that do not generalize well may become ineffective in real-world applications, despite their theoretical capabilities.
Model Complexity
The complexity of a neural network model is typically defined by the number of parameters it possesses. This complexity enables the model to fit different datasets effectively. To understand a model’s complexity fully, it is essential to consider the inductive hypothesis:
Hypothesis
if a solution approximates the target function well over a sufficiently large set of training examples, it is likely to generalize to unseen examples as well.
However, two scenarios can emerge in this context. In the first scenario, a model that is too simplistic relative to the data may underfit, failing to capture the underlying patterns. Conversely, if a model is overly complex and fits the training data too closely, it may overfit, becoming excessively tailored to the training set without performing well on new, unseen data.
The ideal model is one that demonstrates strong generalization capabilities, which means it can make accurate predictions on data it has not encountered during training. To evaluate a model’s generalization ability, it is critical to test it on an independent test set that was not part of the training process. Relying solely on training error can be misleading because this error is based on the very data used for training, leading to overly optimistic performance estimates. New data is unlikely to replicate the training data perfectly, and patterns can be mistakenly identified in random data as well. Thus, the most reliable method for assessing a model’s generalization is through an independent test set, ideally obtained from an external source. Furthermore, when working with classification problems, it is vital to maintain the class distribution by employing stratified sampling during data splits.
To clarify the terminology commonly used in the context of neural network training, several definitions are essential:
Definitions
Training Dataset: The complete set of data available for training the model.
Training Set: The specific data used for learning the model’s parameters.
Test Set: The data reserved for final model evaluation.
Validation Set: The data employed for model selection and assessment during the training phase.
Training Data: Data used for fitting and selecting the model.
Validation Data: Data utilized to evaluate model quality during the selection and assessment phases.
During the development of a neural network, the model is trained exclusively on the training dataset. It is crucial to keep the test data hidden from the developer to ensure an unbiased evaluation. If performance measurement is necessary during training, the validation set can be utilized.
Cross-Validation
Definition
Cross-validation is a vital technique that allows the training dataset to be used not only for training the model but also for estimating its performance on unseen data.
When ample data is available, one can employ a Hold-Out set for validation. However, in scenarios where data is limited, more sophisticated techniques such as Leave-One-Out Cross-Validation (LOOCV) or K-fold Cross-Validation can be employed.
K-fold Cross-Validation is particularly popular because it offers a balanced trade-off between bias and variance. In this method, the training dataset is divided into distinct subsets. The holdout method is then applied times, with each subset serving as a test set once, while the remaining subsets form the training set. This process results in models and estimates of the generalization error. Typically, the average of these estimates is taken as the final assessment of the model’s performance, formulated mathematically as follows:
Warning
While K-fold Cross-Validation can be applied at various stages of model development, such as model assessment and hyperparameter tuning, it is crucial to consider the computational cost associated with training multiple models.
Careful planning and resource allocation are necessary to effectively implement these cross-validation techniques without overwhelming computational resources.
Preventing Neural Networks Overfitting
Overfitting is a common problem in neural networks, where the model becomes too specialized to the training data and fails to generalize well to unseen data. Several techniques can help mitigate this issue, two of the most effective being early stopping and weight decay.
Early Stopping
One of the simplest and most effective methods to prevent overfitting is early stopping. When training a neural network using Stochastic Gradient Descent (SGD), we typically observe a monotonically decreasing training error as the number of iterations increases. However, while the training error continues to decrease, the model may start to lose its ability to generalize to new data. This phenomenon occurs because the network is becoming overly specific to the training set, capturing noise rather than true patterns.
To prevent this, we hold out a portion of the data as a validation set.
Algorithm
The steps are as follows:
Train the model on the training set.
Use the validation set to monitor the model’s performance during training.
Stop training when the validation error starts to increase, indicating that the model has begun to overfit.
Early stopping works by ensuring that the model stops learning before it starts memorizing the training data. This method is effective because it uses the validation error, rather than just the training error, to guide the training process. By stopping early, we prevent the network from fitting the noise in the data.
It is important to note that model selection occurs at different levels:
Parameter Level: Adjusting the weights, , through the training process.
Hyperparameter Level: Deciding the architecture of the network, such as the number of layers, , or the number of hidden neurons in each layer, .
Weight Decay (L2 Regularization)
Weight decay, also known as L2 regularization, is another widely used technique for preventing overfitting. The core idea behind weight decay is to add a penalty term to the loss function, discouraging large weights. This method aims to limit the model’s capacity by constraining its “freedom,” thus reducing its likelihood of overfitting.
Instead of removing data from the training set, we can apply regularization techniques while utilizing all the data. Regularization imposes constraints based on prior assumptions about the model, which helps reduce overfitting.
The traditional approach of training a neural network involves maximizing the data likelihood, known as Maximum Likelihood Estimation (MLE):
However, in a Bayesian framework, we can reduce the model’s freedom by considering the prior distribution of weights and maximizing the posterior distribution, known as Maximum A Posteriori Estimation (MAP):
In practice, small weights tend to improve the generalization of neural networks. By assuming that the weights follow a Gaussian distribution with mean and variance , the MAP estimate becomes:
The objective function for weight decay is modified and be written as a minimization problem: