# Learning wrapper¶

In this section a number of technical details have to be specified, such as the neural network architecture, the interpolation type for various variables etc.

Table of Contents

## Model and neural network settings¶

The number of **hidden layers**, the number of units (**neurons**, nodes) in the hidden layer, the **transfer function** of the hidden layer,
the transfer function of the output layer and the **scaling type** of the neural network model inputs must be specified.

Note

No general recommendation on all these parameters can be given – proper choices strongly depend on the data set (number of observations, number of variables, correlation among model inputs etc.). Instead, some guidelines/rules of thumb are mentioned.

**Recommendations**:

**Number of hidden layers**: in many/most cases 1 or 2 layers of hidden neurons will be sufficient.**Number of hidden units**: typically, a proper number of hidden units is lower or at most in the same regions as the number of inputs. Note that each hidden neural network layer will have this number of hidden units.**Hidden layer transfer function**: mostly, a nonlinear transfer function, such as the*sigmoid*(also referred as*logistic*) or*hyperbolic tangent*gives best results. Available transfer (or activation) functions are**Linear**(Toolbox abbreviation:*Lin*)**Sigmoid**(*Sigm*)**Hyperbolic tangent**(*Tanh*)**Rectified linear unit**(*ReLU*)**SoftPlus**

Tip

Exact

**mathematical descriptions**and visualizations on these (and many more) functions can be found in an excellent Wikipedia article on activation functions.**Output layer transfer function**: for prediction problems, where the outcome is a continuous variable, a*linear*outpt function is strongly recommended.**ANN input SNS**: scaling of the inputs of the neural network; it is generally advised to scale the inputs, e.g. by calculating their z-score (subtraction of the mean and division by their standard deviation). In the hybrid modeling Toolbox the subsequent scaling methods – applied at each input variable – are available¶ Tag

Method

Formula

None

N/A

\[x' = x\]ZScore

Z-score normalization

\[x' = z = \frac{x - \bar{x}}{\sigma}\]Max

Max scaling

\[x' = \frac{x}{\text{max}(x)}\]AbsMax

Absolute Max scaling

\[x' = \frac{x}{\left \| \text{max}(x) \right \|}\]MeanMinMax

Mean normalization

\[x' = \frac{x - \bar(x)}{\text{max}(x)-\text{min}(x)}\]MinMax

Min-Max normalization

\[x' = \frac{x - \text{min}(x)}{\text{max}(x)-\text{min}(x)}\]FeatureScalingTahn

Internal scaling

\[x' = -1 + \frac{2 \cdot (x - \text{min}(x))}{\text{max}(x) - \text{min}(x)}\]FeatureScalingSigm

Internal scaling

\[x' = \frac{x - \text{min}(x)}{\text{max}(x) - \text{min}(x)}\]

Note

The scaling methods *FeatureScalingTahn*, *FeatureScalingSigm* and *MinMax* are special cases of a linear transformation of a
variable \(x\) to an arbitrary interval \([a, b]\):

Therefore,

*FeatureScalingTahn*is obtained by setting \(a = -1, b = 1\).*FeatureScalingSigm*and*MinMax*are obtained by setting \(a = 0, b = 1\).

Note

The number of hidden layers, the number of neurons per layer and the number of iterations (see next chapter) determine the complexity of the neural network and therefore also the
computation time. Starting with **low initial values** for these parameters (and hence with a rather simple neural network) and **gradually increasing** the complexity is advised. If
no or hardly any change in the validation error can be determined, a proper solution is found. Finding a suitable model complexity is generally an iterative process.

Note

Choosing **meaningful values** for the parameters of the neural network model architecture is of high importance and can substantially influence (increase or decrease) the
model performance of the final hybrid model.

## Optimizer stopping criteria¶

The maximum number of iterations (which is the number of updates of the neural network weights) will strongly depend on the size of the data set (both rows and columns) and on the initial starting weights. Starting with a low number \(<100\) is suggested and increasing it, if necessary.

## Data preprocessing¶

In the *data preprocessing* subwindow the **interpolation types** for various variables, the column number for the Time column,
the integrative time step \(dt\) as well as the false data value have to be set.

**Interpolation**: Because of its predictive nature, variables are approximated/interpolated to provide a prediction value at each*integrative step*. The Toolbox provides a number of available interpolation types:**Common**. Creates an interpolation based on arbitrary points.**Const**. The last available value is used.Example

Before (original) and after (interpolated result),

Time

Value

Before

After

1

5.5

5.5

2

5.6

5.6

3

5.6

4

5.6

5

7.0

7.0

6

7.2

7.2

7

8.4

8.4

8

8.4

9

9.0

9.0

10

9.0

**Cubic**. Creates a piecewise natural cubic spline interpolation based on arbitrary points with zero second derivatives at the boundaries.**CubicRobust**. Create a piecewise natural cubic spline interpolation based on arbitrary points, with zero second derivatives at the boundaries.**Hermite**. Creates a piecewise cubic Hermite spline interpolation based on arbitrary points and their slopes/first derivative.**Linear**. Creates a piecewise linear interpolation based on arbitrary points.**Neville**. Creates a Neville polynomial interpolation from an unsorted set of :math:’(x,y)’ value pairs.**Step**. Creates a step-interpolation based on arbitrary points.**Sum**. A cumulative version of the*Const*method.Example

Before (original) and after (interpolated result),

Time

Value

Before

After

1

5.5

5.5

2

5.6

11.1

3

11.1

4

11.1

5

7.0

18.1

6

7.2

25.3

7

8.4

33.7

8

33.7

9

9.0

42.7

10

42.7

Note

Model variables can have different interpolation assigned,

Type

Interpolation

Variable

Name

Min. No. Values 1

Output 2

Input 3

Volume 4

Sensitivity

Differential

Common

1

✓

✓

✓

Const

0

✓

✓

✓

Cubic

2

✓

✓

✓

✓

CubicRobust

2

✓

✓

✓

✓

Hermite

2

✓

✓

✓

✓

Linear

2

✓

✓

✓

✓

Neville

1

✓

✓

✓

Step

1

✓

✓

✓

Sum

0

✓

✓

✓

**Annotations**- 1
Minimum number of valid values per run. Otherwise, interpolation cannot be carried out.

- 2
Applies to selected Output variables. Its interpolation, also depends on selected training type.

- 3
Applies to selected Input variables.

- 4
Applies to selected (if any) Mass balance variables.

**Dataset Time Index**: The dataset’s Time variable index**must**be indicated here.Important

Please provide

`Column No. - 1`

. This means, if the \(Time\) variable is located in the first column of the dataset, provide the number 0; if the time variable is located in the fifth column, provide 4 etc.Note

By default, it’s values is set to the first column (hence 0).

**Integrative**\(dt\): First, the Toolbox \(dt\) value**units**will match the units of the indicated Time column.Example

If datasets time column units are

*seconds*, indicated*dt*units will also become*seconds*.It is recommended to set its value to at least \(0.5 \times\) the dataset time step.

Example

If the dataset time step is

`2 hours`

, provide \(1\) as*dt*value.The lower the integrative step \(dt\) is, the more precise the predicted value

**might**be.Note

Setting a low integrative step value \(dt\) sometimes don’t result in better predictions than with already recommended criteria.

The lower the integrative step value \(dt\) is, the more integrative steps must be performed resulting in higher computational costs.

**Recommendation**: use the suggested \(dt\) value as a first approach and – depending on the obtained results – decrease or increase the \(dt\) value to explore possible better configurations.

**False data value**: If the dataset contains invalid values (such as \(-1000\) or \(-9.999\)) to code missing values, this invalid value must be specified.

## Train options¶

**Learning Type**: choose between*Sensitivity*and*Differential*learning:**Sensitivity**. Performs a sensitivity analysis. Dynamic approach by simulating (predicting) the model behavior over time (integrative dt ). The model predicted (integrated) output values are compared to the true dataset values for error calculation and model tunning.

**Differential**. Performs a differential analysis. Static approach by calculating the error directly (not integrated over time) using the model output values and the true values from the dataset.

Note

**Sensitivity**learning will take more time to train than differential learning because instead of evaluating outputs for error calculation directly, as many intermediate integrative steps (resulting from specified integrative dt) will be additionally performed.If

**differential analysis**is selected and`Diff. Train auto-transition`

is checked, once finished, a sensitivity learning will be automatically started with differential structure result as its initial setup (not randomly initialized).**Integration**is only carried out when sensitivity analysis is selected.

**Learning error evaluation**: Typically, Output variables are not high frequency variables, i.e. they are rarely available. Consequently many cells in the output variables columns in the dataset have missing values. When a value for an output variable is available, its value is compared with the model predicted value and an error is calculated and used for further optimization. The error will be calculated for each step where**at least one**output variable is available.**Interpolated**. If an output variable’s value is not available, an interpolated value is used for error calculation.**Original**. The error calculation is based only on observations (time points), when an output variable value is available.

Important

It is recommended to use the

*Interpolation*type for error calculation.Example

Consider the following artificial dataset

¶ VAR_Time

VAR_1

VAR_2

VAR_3

VAR_4

VAR_5

1

1.10

34.66

1.80

7.12

5.14

2

1.30

34.70

2.70

3

1.45

34.60

2.30

4

1.70

34.65

2.50

7.56

5

1.78

34.30

2.98

7.48

6

1.79

34.20

2.87

8.12

7

2.00

34.25

2.47

9.45

8

2.10

34.80

2.34

9

2.32

34.10

2.10

7.88

10

2.40

34.55

2.60

7.94

1.30

where

Input variables are:

`VAR_1`

,`VAR_2`

,`VAR_3`

.Output variables are:

`VAR_4`

,`VAR_5`

.

As described, the model error will be calculated at

`ErrorCalcTimes = {1, 4, 5, 6, 7, 9, 10}`

, as at least one of the output variables is available at these times.If the

**error evaluation type**is set to*Interpolated*, the model error will be calculated for`VAR_4`

,`VAR_5`

at each time in`ErrorCalcTimes`

. If a value for an output variable is not available (e.g. at time \(t = 4\) for variable`VAR_4`

), an interpolated value will be used.*Original*, the model error will be calculated for`VAR_4`

only at times`1, 4, 9, 10`

.`VAR_5`

only at times`1, 5, 6, 7, 10`

.

On one hand, if using

*interpolated*, model tuning will be carried out considering all outputs errors and backpropagated consistently, at the cost of a potential interpolation error. On the other hand, when using*original*, no interpolation error is introduced but backpropagation won’t be completely correct, as weights will only be updated according to only valid outputs, decompensating behavior of neural network for non-calculated output error. Using*original*might also run in bigger problems, if one output variable is available many more times than other ones.**Accelerate learning**: Choosing this option, random**starts**are executed**in parallel**. This speeds up learning drastically. The number of parallel threads will depend on computers hardware; number of cores, RAM …

**Number of best models**: A different model is generated for each iteration. If the setup consists of \(11\) boots, \(15\) starts, \(10\) steps and \(20\) iterations, the total number of generated models is\[11\text{ boots} \cdot 15\frac{\text{starts}}{\text{boot}} \cdot 10\frac{\text{steps}}{\text{start}} \cdot 20\frac{\text{iterations}}{\text{step}} = 33.000 \text{ iterations} = 33.000 \text{ models}\]It is not reasonable to keep all these models in memory and the argument/parameter

`No. best models`

comes into play. When setting this parameter, no matter how many models are generated**per start**, only the best ones (as many as`No. best models`

) will be kept into memory and the rest will be auto-disposed.When evaluating the

*best*models, their*Training*and*Validation*errors are calculated and as many models as`No. best models`

are kept for each of these two error types. Therfore, a total number of \(2 \cdot \text{No. best models}\)**per start**is kept in memory.Note

When report is created, for each start, \(2 \cdot \text{No. best models}\) are displayed for selection. If the same model exists as

*Training*and*Validation*best models, it is displayed only once. Therefore, the available number of models for selection (for each start) is given by,\[2\cdot \text{min}(\text{No. best models}, \text{No. Steps} \cdot \text{No. Iterations})\]

**Number of Starts**: The number of times per boot that a new model is randomly initialized (i.e. started from random neural network weights). It is advised to use multiple random starts to avoid falling into a local (instead of a global) minimum of the loss function.

**Number of Steps**: The number of steps per random start.Important

The terms Steps and iterations might be confusing, but – for the sake of convenience – they can be considered the same. The total number of iterations per start is \(\text{No. Iterations (per start)} = \text{No. Steps} \cdot \text{No. Iterations}\). The reason why they are decoupled, is because after each

*step*some mathematical actions are carried out to possibly improve performance.Note

It is recommended to use

`Steps = 1`

and choose directly as many iterations as wanted. If the model should stop after 240 iterations (per start), and scene behind actions might be useful, a configuration like`Steps = 4`

and`Iterations = 60`

could be set.

**Number of Boots**: A**boot**is a certain**random split**of the available data into a*Training*and a*Validation*part according to the specified splitting ratio.The maximum number of possible combinations to randomly draw \(N_\text{Train}\) runs from in total \(N_\text{TrainValid}\) is given by

\[\binom{N_\text{TrainValid}}{N_\text{Train}} = \frac{N_\text{TrainValid}!}{N_\text{Train}!(N_\text{TrainValid} - N_\text{Train})!}\]Note

If the specified number of boots is greater than the maximum number of possible combinations, the number of boots will automatically be limited to this maximum value.

Example

Consider a single dataset containing \(6\) different sets \(\text{TrainValid} = \begin{Bmatrix}1,2,3,4,5,6\end{Bmatrix}\) with \(R_\text{TrainValid} = 0.75\).

From previous example we get

\[\begin{split}\begin{matrix} N_\text{TrainValid} = 6 \\ N_\text{Valid} = 1 \\ N_\text{Train} = 5 \end{matrix}\end{split}\]With this configuration the maximum number of combinations (boots) is therefore

\[\binom{6}{5} = \frac{6!}{5!(6-5)!} = 6\]Even if the number of boots is set to more than \(6\), only \(6\) boots will be carried out as this is the maximum number of possible combinations. Assume the number of boots is set to \(3\) – the resulting boots combinations might look like

\[\begin{split}\text{Boot 1} \longrightarrow \left\{\begin{matrix} \text{Train} = \begin{Bmatrix}1,3,4,5,6\end{Bmatrix} \\ \text{Valid} = \begin{Bmatrix}2\end{Bmatrix} \end{matrix}\right.\end{split}\]\[\begin{split}\text{Boot 2} \longrightarrow \left\{\begin{matrix} \text{Train} = \begin{Bmatrix}1,2,3,4,6\end{Bmatrix} \\ \text{Valid} = \begin{Bmatrix}5\end{Bmatrix} \end{matrix}\right.\end{split}\]\[\begin{split}\text{Boot 3} \longrightarrow \left\{\begin{matrix} \text{Train} = \begin{Bmatrix}1,2,5,4,6\end{Bmatrix} \\ \text{Valid} = \begin{Bmatrix}3\end{Bmatrix} \end{matrix}\right.\end{split}\]

**Train/Val Ratio**: A split of the dataset in a*Training*and a*Validation*part is essential for model optimization. The Toolbox allows to generate many such splits automatically and randomly by a run/experiment variable.If

\(N_\text{TrainValid}\) is the total number of runs contained along all datasets selected for Training and Validation and

\(R _\text{TrainValid}\) is the specified Training/Validation ratio,

the number of runs/experiments used for training (\(N_\text{Train}\)) and validation (\(N_\text{Train}\)) during each boot, are calculated as follows:

\[\begin{split}\begin{matrix} N_\text{Valid} = \text{max}(\left \lfloor 1 - R_\text{TrainValid} \right \rfloor \cdot N_\text{TrainValid} , 1) \\ N_\text{Train} = \text{max}(N_\text{TrainValid} - N_\text{Valid}, 1) \end{matrix}\end{split}\]So there is at least one run in the training and validation set. The higher the ratio \(R_\text{TrainValid}\) is chosen, the more runs/experiments will go to the training set and the fewer to the validation set.

Note

In the following scenarios, validation is biased, as its datasets are also being partially/completely used for training.

If the

*Train/Valid*datasets only contain**a single set**, then – no matter what the ratio is – the same set will be used indistinctly for training and validation. Therefore, its errors will be the same.If \(N_\text{TrainValid} \neq N_\text{Train} + N_\text{Valid}\), the datasets are being reused.

If the ratio is set to

`0`

, all datasets will be used for validation, and one of those is also used for training.If the ratio is set to

`1`

, all datasets will be used for training, and one of those is also used for validation.

Note

It is recommended to use \(R_\text{TrainValid}\) values of \(> 0.5\) – the training set shall always be larger than the validation set.

Example 1

Consider a single dataset containing 6 different runs (\(N_\text{TrainValid} = 6\)) and a Training/Validation ratio of

`0.75`

(\(R_\text{TrainValid} = 0.75\)). Then\[\begin{split}\begin{matrix} N_\text{Valid} = \text{max}(\left \lfloor 1 - 0.75 \right \rfloor \cdot 6, 1) = 1 \\ N_\text{Train} = \text{max}(6 - 1, 1) = 5 \end{matrix}\end{split}\]Example 2

Consider a single dataset containing 6 different runs (\(N_\text{TrainValid} = 6\)) and a Training/Validation ratio of

`0.7`

(\(R_\text{TrainValid} = 0.6\)). Then\[\begin{split}\begin{matrix} N_\text{Valid} = \text{max}(\left \lfloor 1- 0.6 \right \rfloor \cdot 6, 1) = 2 \\ N_\text{Train} = \text{max}(6 - 2, 1) = 4 \end{matrix}\end{split}\]**Differential training auto-transition**: performs a sensitivity learning after a differential learning. This feature will only be performed if*differential training*is selected.**Perform clustering**: check this box, if clustering shall be performed.

## Clustering options¶

**Pdist****Niter****Ncluster****trys****Tau****Initialize clustering with random centers****ClusterUpdatePureKmeans**