Activation Functions
An activation function is the small mathematical curve applied at each neuron. After a neuron adds up its weighted inputs, the activation function reshapes that sum into the neuron's output. This non-linear reshaping is what gives a neural network its ability to model curves and interactions rather than just straight-line relationships — without it, stacking layers would be no more powerful than a single linear formula.
You choose the activation for each layer of an MLP with the
SetNetworkWithActivationLayer1…4 functions described on the
Network Architectures page. Each function takes the
hidden-layer activations as the second item of each (neurons, activation)
pair, and the output-layer activation as the final argument. If you use the plain
SetNetworkLayer… functions instead, every layer defaults to
Sigmoid.
The activation codes
The activation is selected by an integer code. The full set is below.
| Code | Activation — when to use |
|---|---|
| 0 | Sigmoid (logistic) — saturating, outputs 0..1. Good as an output activation for probabilities / classification (up vs down). |
| 1 | Tanh — saturating, outputs −1..1, zero-centred. A solid, reliable general-purpose hidden activation. |
| 2 | Symmetric Elliot — a cheaper, tanh-like sigmoid in the −1..1 range. |
| 3 | Symmetric Gaussian — a bell-shaped response in a symmetric range; specialised, for localised patterns. |
| 4 | Elliot — a cheaper, faster approximation of the sigmoid in the 0..1 range. |
| 5 | Gaussian — bell-shaped response; specialised. |
| 6 | Inverse Gaussian — an inverted bell; specialised. |
| 7 | ReLU — non-saturating; passes positives through, zeroes negatives. A strong modern choice for hidden layers; pair with He weight initialisation. |
| 8 | LeakyReLU — like ReLU but lets a small slope through on the negative side, avoiding "dead" neurons. A safe alternative to ReLU for hidden layers. |
| 9 | Linear (identity) — passes the value straight through with no limit. The recommended output activation for regression, where the prediction can be any value. |
Saturating vs. non-saturating
The activations fall into two broad groups, and the distinction matters in practice.
Saturating activations (sigmoid, tanh, the Elliot and Gaussian variants) squash their output into a fixed range and flatten out at the extremes. They are smooth and well-behaved and have been the classic choice for hidden layers. Their weakness is that very large inputs push them into their flat regions, where learning slows down.
Non-saturating activations (ReLU and LeakyReLU) do not flatten on the positive side, which keeps learning brisk in deeper networks. ReLU can occasionally leave neurons permanently stuck at zero ("dead units"); LeakyReLU avoids this by allowing a small negative slope. These pair naturally with He weight initialisation (see Accuracy & Avoiding Overfitting).
For a regression network that predicts an unbounded value (such as a future return), use a saturating or ReLU activation on the hidden layers and a Linear output (code 9). If the output stays a saturating function such as sigmoid, the prediction is trapped in 0..1 and can never reach the values you want. For classification (a 0..1 probability), use a Sigmoid output (code 0) together with the Tanh loss — see Accuracy & Avoiding Overfitting.
Shapes of the saturating activations
The images below show the shape of each saturating activation — the curve that maps the neuron's summed input (horizontal axis) to its output (vertical axis).
Activations are an MLP feature. The recurrent LSTM and GRU models use fixed internal gate activations and a linear output, so the codes above do not apply to them.