Once the Neural Network is configured, it is ready to be trained to learn features from data-set.


Figure 3.10: Training Neural Network Interface.


Stopping criteria


Figure 3.11: Stopping criteria



The stopping criteria contains parameters that control when to end/stop training procedure. The Neural Network training is stopped when following conditions occurs

    •  Training goal is reached. The training goal is measured by cost function value for given Neural Network' weights and given training data inputs. If this value < training goal specified in stopping criteria (0.0001 for example), the training goal is reached.
    • Gradient goal is reached. If the gradient value for given Neural Network' weights and given training data inputs < gradient goal specified in stopping criteria (0.01 for example), the gradient goal is reached.
    • Training epoch is reached. When all training data inputs are fed into the Neural Network, it propagates through Neural Network layers to calculate Neural Network outputs; these outputs are then compared to training data targets to calculate cost function and errors that will be used to back propagate through Neural Network layers to adjust its weights to optimize the cost function, the training epoch increase by 1. When training epoch value > the value specified in stopping criteria (30 for example), the training epoch is reached.
    • When training goal, gradient goal and training epoch are both not reached, the Max Fails will increase by 1. The training will stop if the Max Fails value > the value specified in the stopping criteria settings (1 for example).


When training data-set becomes too big/large, feeding all training data into Neural Network during training process will cause performance issue due to restriction in computer speed and internal memory. To avoid that issue, the training data-set will be broken down into smaller sets that has the size defined in Mini Batch Size. Only first order Gradient Descent training algorithm supports that feature.  


During the training process, the training data-set is used to adjust the Neural Network weights to optimize the cost function. The value of this cost function is called performance index or performance of training data-set. The validation set is used to prevent over-fitting/over-trained issue. The over-fitting issue happens with the Neural Network is trained to work well with training data-set but fail to predict outputs of the new data-set. The technique that uses validation data-set to prevent over-fitting issue is called early stopping technique. According to that technique, the stopping location is the area when the validation performance starts to go up as shown in Figure 3.12.



Figure 3.12: Early stopping technique to find stopping location.


When Bayesian Regularization training algorithm is selected, the early stopping will not be applied as the validation set is not required. In the Bayesian Regularization, the cost function contains a regularization term, also called a weight decay term. During training process, that weight decay term tends to decrease the magnitude of weights that help preventing over-fitting issue. As a result, the training performance will be shown as in Figure 3.13.



 

Figure 3.13: Performance curve when applying Bayesian Regularization training algorithm.



   

       

Training parameters


The training parameters are the parameters required for training algorithms to work and perform well. Depending on a certain training algorithm, certain parameter settings are set and displayed in the Training Neural Network step as shown in Figure 3.14.


 

Figure 3.14: Different training parameters for given training algorithms.


The recommended training parameters are set based on given data-set, but advanced users can also tweak to achieve better training performance.


Optimal Neural Network Structure


Although based on the given data-set, the recommended Neural Network structure is configured. However, it is not optimal. Currently, designers do not know what are exactly the number of hidden layers and hidden nodes, and "trial and error" method is widely uses to determine a Neural Network structure.


In fact, a Neural Network is 1 hidden layer can approximate any function that contains a continuous mapping from one finite space to another. As a result, the one hidden layer Neural Network is sufficient for most applications. For image recognition, text to speech, and sound generation applications, it is better to use deep learning technique such as Convolution Neural Network.  


So how do we determine the number of hidden nodes? There are many rule-of-thumb methods for determining the correct number of hidden nodes, but they are all "trial and error" methods.


Some of rule-of-thumb are:


      • The number of hidden neurons should be between the size of the input layer and the size of the output layer.
      • The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer.
      • The number of hidden neurons should be less than twice the size of the input layer.



When Bayesian Regularization training algorithm is select, the Evidence Framework based on Bayes theory is used to determine the evidence of the Neural Network model.

Based on the evidence ranking for different Neural Network structure, the optimal structure is the one that has highest evidence value as shown in Figure 3.15.


Figure 3.15: Optimal Neural Network Structure based on Evidence Framework


ANNHUB Professional Edition supports Evidence Framework that allow users evaluate different structures of the Neural Network by specifying the range of hidden nodes (Min Hidden Nodes and Max Hidden Nodes). Evidence Framework will scan all structures and generate Evidence Plot that indicate the optimal hidden nodes should be.


Notes: Since Evidence Framework is based on probability theory and Neural Network weights are initialized randomly, Evidence Plot might vary from each run (by click Get Evidence button).