Learning wrapper¶

In this section a number of technical details have to be specified, such as the neural network architecture, the interpolation type for various variables etc.

Model and neural network settings¶

The number of hidden layers, the number of units (neurons, nodes) in the hidden layer, the transfer function of the hidden layer, the transfer function of the output layer and the scaling type of the neural network model inputs must be specified.

Note

No general recommendation on all these parameters can be given – proper choices strongly depend on the data set (number of observations, number of variables, correlation among model inputs etc.). Instead, some guidelines/rules of thumb are mentioned.

Recommendations:

Number of hidden layers: in many/most cases 1 or 2 layers of hidden neurons will be sufficient.
Number of hidden units: typically, a proper number of hidden units is lower or at most in the same regions as the number of inputs. Note that each hidden neural network layer will have this number of hidden units.
Hidden layer transfer function: mostly, a nonlinear transfer function, such as the sigmoid (also referred as logistic) or hyperbolic tangent gives best results. Available transfer (or activation) functions are
- Linear (Toolbox abbreviation: Lin)
- Sigmoid (Sigm)
- Hyperbolic tangent (Tanh)
- Rectified linear unit (ReLU)
- SoftPlus
Tip

Exact mathematical descriptions and visualizations on these (and many more) functions can be found in an excellent Wikipedia article on activation functions.
Output layer transfer function: for prediction problems, where the outcome is a continuous variable, a linear outpt function is strongly recommended.

ANN input SNS: scaling of the inputs of the neural network; it is generally advised to scale the inputs, e.g. by calculating their z-score (subtraction of the mean and division by their standard deviation). In the hybrid modeling Toolbox the subsequent scaling methods – applied at each input variable – are available

Table 8. ANN inputs scaling methods¶
Tag	Method	Formula
None	N/A	\[x' = x\]
ZScore	Z-score normalization	\[x' = z = \frac{x - \bar{x}}{\sigma}\]
Max	Max scaling	\[x' = \frac{x}{\text{max}(x)}\]
AbsMax	Absolute Max scaling	\[x' = \frac{x}{\left \\| \text{max}(x) \right \\|}\]
MeanMinMax	Mean normalization	\[x' = \frac{x - \bar(x)}{\text{max}(x)-\text{min}(x)}\]
MinMax	Min-Max normalization	\[x' = \frac{x - \text{min}(x)}{\text{max}(x)-\text{min}(x)}\]
FeatureScalingTahn	Internal scaling	\[x' = -1 + \frac{2 \cdot (x - \text{min}(x))}{\text{max}(x) - \text{min}(x)}\]
FeatureScalingSigm	Internal scaling	\[x' = \frac{x - \text{min}(x)}{\text{max}(x) - \text{min}(x)}\]

Note

The scaling methods FeatureScalingTahn, FeatureScalingSigm and MinMax are special cases of a linear transformation of a variable \(x\) to an arbitrary interval \([a, b]\):

\[x' = a + \frac{(x - \text{min}(x))(b-a)}{\text{max}(x)-\text{min}(x)}\]

Therefore,

FeatureScalingTahn is obtained by setting \(a = -1, b = 1\).
FeatureScalingSigm and MinMax are obtained by setting \(a = 0, b = 1\).

Note

The number of hidden layers, the number of neurons per layer and the number of iterations (see next chapter) determine the complexity of the neural network and therefore also the computation time. Starting with low initial values for these parameters (and hence with a rather simple neural network) and gradually increasing the complexity is advised. If no or hardly any change in the validation error can be determined, a proper solution is found. Finding a suitable model complexity is generally an iterative process.

Note

Choosing meaningful values for the parameters of the neural network model architecture is of high importance and can substantially influence (increase or decrease) the model performance of the final hybrid model.

Optimizer stopping criteria¶

The maximum number of iterations (which is the number of updates of the neural network weights) will strongly depend on the size of the data set (both rows and columns) and on the initial starting weights. Starting with a low number \(<100\) is suggested and increasing it, if necessary.

Data preprocessing¶

In the data preprocessing subwindow the interpolation types for various variables, the column number for the Time column, the integrative time step \(dt\) as well as the false data value have to be set.

Interpolation: Because of its predictive nature, variables are approximated/interpolated to provide a prediction value at each integrative step. The Toolbox provides a number of available interpolation types:

Common. Creates an interpolation based on arbitrary points.
Const. The last available value is used.

Example

Before (original) and after (interpolated result),

Time

Value

Before

After

1

5.5

5.5

2

5.6

5.6

3

5.6

4

5.6

5

7.0

7.0

6

7.2

7.2

7

8.4

8.4

8

8.4

9

9.0

9.0

10

9.0
Cubic. Creates a piecewise natural cubic spline interpolation based on arbitrary points with zero second derivatives at the boundaries.
CubicRobust. Create a piecewise natural cubic spline interpolation based on arbitrary points, with zero second derivatives at the boundaries.
Hermite. Creates a piecewise cubic Hermite spline interpolation based on arbitrary points and their slopes/first derivative.
Linear. Creates a piecewise linear interpolation based on arbitrary points.
Neville. Creates a Neville polynomial interpolation from an unsorted set of :math:’(x,y)’ value pairs.
Step. Creates a step-interpolation based on arbitrary points.
Sum. A cumulative version of the Const method.

Example

Before (original) and after (interpolated result),

Time

Value

Before

After

1

5.5

5.5

2

5.6

11.1

3

11.1

4

11.1

5

7.0

18.1

6

7.2

25.3

7

8.4

33.7

8

33.7

9

9.0

42.7

10

42.7

Note

Model variables can have different interpolation assigned,

Type
Interpolation		Variable
Name	Min. No. Values [1]	Output [2]		Input [3]	Volume [4]
Name	Min. No. Values [1]	Sensitivity	Differential	Input [3]	Volume [4]
Common	1	✓		✓	✓
Const	0	✓		✓	✓
Cubic	2	✓	✓	✓	✓
CubicRobust	2	✓	✓	✓	✓
Hermite	2	✓	✓	✓	✓
Linear	2	✓	✓	✓	✓
Neville	1	✓		✓	✓
Step	1	✓		✓	✓
Sum	0	✓		✓	✓

Annotations

Dataset Time Index: The dataset’s Time variable index must be indicated here.

Important

Please provide Column No. - 1. This means, if the \(Time\) variable is located in the first column of the dataset, provide the number 0; if the time variable is located in the fifth column, provide 4 etc.

Note

By default, it’s values is set to the first column (hence 0).

Integrative \(dt\): First, the Toolbox \(dt\) value units will match the units of the indicated Time column.

Example

If datasets time column units are seconds, indicated dt units will also become seconds.

It is recommended to set its value to at least \(0.5 \times\) the dataset time step.

Example

If the dataset time step is 2 hours, provide \(1\) as dt value.

The lower the integrative step \(dt\) is, the more precise the predicted value might be.
Note
- Setting a low integrative step value \(dt\) sometimes don’t result in better predictions than with already recommended criteria.
- The lower the integrative step value \(dt\) is, the more integrative steps must be performed resulting in higher computational costs.
- Recommendation: use the suggested \(dt\) value as a first approach and – depending on the obtained results – decrease or increase the \(dt\) value to explore possible better configurations.
False data value: If the dataset contains invalid values (such as \(-1000\) or \(-9.999\)) to code missing values, this invalid value must be specified.

Train options¶

Learning Type: choose between Sensitivity and Differential learning:
- Sensitivity. Performs a sensitivity analysis. Dynamic approach by simulating (predicting) the model behavior over time
  (integrative dt ). The model predicted (integrated) output values are compared to the true dataset values for error calculation and model tunning.
- Differential. Performs a differential analysis. Static approach by calculating the error directly (not integrated over time) using the model output values and the true values from the dataset.
Note
- Sensitivity learning will take more time to train than differential learning because instead of evaluating outputs for error calculation directly, as many intermediate integrative steps (resulting from specified integrative dt) will be additionally performed.
  - If differential analysis is selected and Diff. Train auto-transition is checked, once finished, a sensitivity learning will be automatically
    started with differential structure result as its initial setup (not randomly initialized).
  - Integration is only carried out when sensitivity analysis is selected.

Learning error evaluation: Typically, Output variables are not high frequency variables, i.e. they are rarely available. Consequently many cells in the output variables columns in the dataset have missing values. When a value for an output variable is available, its value is compared with the model predicted value and an error is calculated and used for further optimization. The error will be calculated for each step where at least one output variable is available.

Interpolated. If an output variable’s value is not available, an interpolated value is used for error calculation.
Original. The error calculation is based only on observations (time points), when an output variable value is available.

Important

It is recommended to use the Interpolation type for error calculation.

Example

Consider the following artificial dataset

Table 9. Example dataset with missing values in the two output variables.¶
VAR_Time	VAR_1	VAR_2	VAR_3	VAR_4	VAR_5
1	1.10	34.66	1.80	7.12	5.14
2	1.30	34.70	2.70
3	1.45	34.60	2.30
4	1.70	34.65	2.50	7.56
5	1.78	34.30	2.98		7.48
6	1.79	34.20	2.87		8.12
7	2.00	34.25	2.47		9.45
8	2.10	34.80	2.34
9	2.32	34.10	2.10	7.88
10	2.40	34.55	2.60	7.94	1.30

where

Input variables are: VAR_1, VAR_2, VAR_3.
Output variables are: VAR_4, VAR_5.

As described, the model error will be calculated at ErrorCalcTimes = {1, 4, 5, 6, 7, 9, 10}, as at least one of the output variables is available at these times.

If the error evaluation type is set to

Interpolated, the model error will be calculated for VAR_4, VAR_5 at each time in ErrorCalcTimes. If a value for an output variable is not available (e.g. at time \(t = 4\) for variable VAR_4), an interpolated value will be used.
Original, the model error will be calculated for
- VAR_4 only at times 1, 4, 9, 10.
- VAR_5 only at times 1, 5, 6, 7, 10.

On one hand, if using interpolated, model tuning will be carried out considering all outputs errors and backpropagated consistently, at the cost of a potential interpolation error. On the other hand, when using original, no interpolation error is introduced but backpropagation won’t be completely correct, as weights will only be updated according to only valid outputs, decompensating behavior of neural network for non-calculated output error. Using original might also run in bigger problems, if one output variable is available many more times than other ones.

Accelerate learning: Choosing this option, random starts are executed in parallel. This speeds up learning drastically. The number of parallel threads will depend on computers hardware; number of cores, RAM …

Number of best models: A different model is generated for each iteration. If the setup consists of \(11\) boots, \(15\) starts, \(10\) steps and \(20\) iterations, the total number of generated models is

\[11\text{ boots} \cdot 15\frac{\text{starts}}{\text{boot}} \cdot 10\frac{\text{steps}}{\text{start}} \cdot 20\frac{\text{iterations}}{\text{step}} = 33.000 \text{ iterations} = 33.000 \text{ models}\]

It is not reasonable to keep all these models in memory and the argument/parameter No. best models comes into play. When setting this parameter, no matter how many models are generated per start, only the best ones (as many as No. best models) will be kept into memory and the rest will be auto-disposed.

When evaluating the best models, their Training and Validation errors are calculated and as many models as No. best models are kept for each of these two error types. Therfore, a total number of \(2 \cdot \text{No. best models}\) per start is kept in memory.

Note

When report is created, for each start, \(2 \cdot \text{No. best models}\) are displayed for selection. If the same model exists as Training and Validation best models, it is displayed only once. Therefore, the available number of models for selection (for each start) is given by,

\[2\cdot \text{min}(\text{No. best models}, \text{No. Steps} \cdot \text{No. Iterations})\]

Number of Starts: The number of times per boot that a new model is randomly initialized (i.e. started from random neural network weights). It is advised to use multiple random starts to avoid falling into a local (instead of a global) minimum of the loss function.

Number of Steps: The number of steps per random start.

Important

The terms Steps and iterations might be confusing, but – for the sake of convenience – they can be considered the same. The total number of iterations per start is \(\text{No. Iterations (per start)} = \text{No. Steps} \cdot \text{No. Iterations}\). The reason why they are decoupled, is because after each step some mathematical actions are carried out to possibly improve performance.

Note

It is recommended to use Steps = 1 and choose directly as many iterations as wanted. If the model should stop after 240 iterations (per start), and scene behind actions might be useful, a configuration like Steps = 4 and Iterations = 60 could be set.

Number of Boots: A boot is a certain random split of the available data into a Training and a Validation part according to the specified splitting ratio.

The maximum number of possible combinations to randomly draw \(N_\text{Train}\) runs from in total \(N_\text{TrainValid}\) is given by

\[\binom{N_\text{TrainValid}}{N_\text{Train}} = \frac{N_\text{TrainValid}!}{N_\text{Train}!(N_\text{TrainValid} - N_\text{Train})!}\]

Note

If the specified number of boots is greater than the maximum number of possible combinations, the number of boots will automatically be limited to this maximum value.

Example

Consider a single dataset containing \(6\) different sets \(\text{TrainValid} = \begin{Bmatrix}1,2,3,4,5,6\end{Bmatrix}\) with \(R_\text{TrainValid} = 0.75\). From previous example we get

\[\begin{split}\begin{matrix} N_\text{TrainValid} = 6 \\ N_\text{Valid} = 1 \\ N_\text{Train} = 5 \end{matrix}\end{split}\]

With this configuration the maximum number of combinations (boots) is therefore

\[\binom{6}{5} = \frac{6!}{5!(6-5)!} = 6\]

Even if the number of boots is set to more than \(6\), only \(6\) boots will be carried out as this is the maximum number of possible combinations. Assume the number of boots is set to \(3\) – the resulting boots combinations might look like

\[\begin{split}\text{Boot 1} \longrightarrow \left\{\begin{matrix} \text{Train} = \begin{Bmatrix}1,3,4,5,6\end{Bmatrix} \\ \text{Valid} = \begin{Bmatrix}2\end{Bmatrix} \end{matrix}\right.\end{split}\]

\[\begin{split}\text{Boot 2} \longrightarrow \left\{\begin{matrix} \text{Train} = \begin{Bmatrix}1,2,3,4,6\end{Bmatrix} \\ \text{Valid} = \begin{Bmatrix}5\end{Bmatrix} \end{matrix}\right.\end{split}\]

\[\begin{split}\text{Boot 3} \longrightarrow \left\{\begin{matrix} \text{Train} = \begin{Bmatrix}1,2,5,4,6\end{Bmatrix} \\ \text{Valid} = \begin{Bmatrix}3\end{Bmatrix} \end{matrix}\right.\end{split}\]

Train/Val Ratio: A split of the dataset in a Training and a Validation part is essential for model optimization. The Toolbox allows to generate many such splits automatically and randomly by a run/experiment variable.

If,
- \(N_\text{TrainValid}\) is the total number of runs contained along all datasets selected for Training and Validation and
- \(R _\text{TrainValid}\) is the specified Training/Validation ratio,
the number of runs/experiments used for training (\(N_\text{Train}\)) and validation (\(N_\text{Train}\)) during each boot, are calculated as follows:

\[\begin{split}\begin{matrix} N_\text{Valid} = \text{max}(\left \lfloor 1 - R_\text{TrainValid} \right \rfloor \cdot N_\text{TrainValid} , 1) \\ N_\text{Train} = \text{max}(N_\text{TrainValid} - N_\text{Valid}, 1) \end{matrix}\end{split}\]

So there is at least one run in the training and validation set. The higher the ratio \(R_\text{TrainValid}\) is chosen, the more runs/experiments will go to the training set and the fewer to the validation set.
Note

In the following scenarios, validation is biased, as its datasets are also being partially/completely used for training.
- If the Train/Valid datasets only contain a single set, then – no matter what the ratio is – the same set will be used indistinctly for training and validation. Therefore, its errors will be the same.
- If \(N_\text{TrainValid} \neq N_\text{Train} + N_\text{Valid}\), the datasets are being reused.
- If the ratio is set to 0, all datasets will be used for validation, and one of those is also used for training.
- If the ratio is set to 1, all datasets will be used for training, and one of those is also used for validation.
Note

It is recommended to use \(R_\text{TrainValid}\) values of \(> 0.5\) – the training set shall always be larger than the validation set.

Example 1

Consider a single dataset containing 6 different runs (\(N_\text{TrainValid} = 6\)) and a Training/Validation ratio of 0.75 (\(R_\text{TrainValid} = 0.75\)). Then,

\[\begin{split}\begin{matrix} N_\text{Valid} = \text{max}(\left \lfloor 1 - 0.75 \right \rfloor \cdot 6, 1) = 1 \\ N_\text{Train} = \text{max}(6 - 1, 1) = 5 \end{matrix}\end{split}\]

Example 2

Consider a single dataset containing 6 different runs (\(N_\text{TrainValid} = 6\)) and a Training/Validation ratio of 0.7 (\(R_\text{TrainValid} = 0.6\)). Then,

\[\begin{split}\begin{matrix} N_\text{Valid} = \text{max}(\left \lfloor 1- 0.6 \right \rfloor \cdot 6, 1) = 2 \\ N_\text{Train} = \text{max}(6 - 2, 1) = 4 \end{matrix}\end{split}\]
Differential training auto-transition: performs a sensitivity learning after a differential learning. This feature will only be performed if differential training is selected.
Perform clustering: check this box, if clustering shall be performed.

Clustering options¶

Pdist
Niter
Ncluster
trys
Tau
Initialize clustering with random centers
ClusterUpdatePureKmeans

Time	Value
Time	Before	After
1	5.5	5.5
2	5.6	5.6
3		5.6
4		5.6
5	7.0	7.0
6	7.2	7.2
7	8.4	8.4
8		8.4
9	9.0	9.0
10		9.0