WiseTrader Toolbox

Training Algorithms

The training algorithm (the optimizer) is the method the network uses to adjust its weights so its predictions get closer to your target. The toolbox offers 14 of them, from the classic resilient-propagation family to the modern Adam family. You choose one with SetLearningAlgorithm.

Tip

Honest perspective first: on noisy market data the optimizer is rarely what decides your results — overfitting is. Pick a sensible algorithm from this page, then spend your effort on the anti-overfitting settings, which make a much bigger difference.

Choosing the optimizer: SetLearningAlgorithm

SetLearningAlgorithm(Code)
CodeAlgorithm — family — when to use
0Backpropagation — SGD + momentum. The classic method; uses the learning rate and momentum. Usually superseded by the options below.
1Self-Adaptive (SSAB) — per-weight adaptive rates.
2RPROP — resilient propagation. Ignores the learning rate; strong for full-batch training.
3SARPROP — an RPROP variant.
4iRPROP+ — improved resilient propagation. The robust default. No learning rate to tune, fast, and the strongest "set and forget" choice for the usual full-batch training.
5ARPROP — an RPROP variant; prefer iRPROP+.
6Nesterov — SGD with look-ahead momentum. Uses the learning rate and momentum.
7Adam — modern adaptive optimizer. Wants a small learning rate (~0.001–0.005).
8AdamW — Adam with decoupled weight decay. The best modern choice when overfitting is a concern; pairs the adaptiveness of Adam with built-in regularization.
9RMSProp — adaptive; uses the learning rate and a decay parameter.
10AMSGrad — an Adam variant with a steadier tail.
11NAdam — Adam with Nesterov momentum; a competitive alternative.
12Adagrad — adaptive; its per-weight rate decays over time and can stall on long runs. Avoid for long training.
13Adadelta — self-scaling; needs no learning rate (leave it at 1.0).

If you never call this function the default is Backpropagation (code 0). Codes 0–5 are the original algorithms; 6–13 are the modern additions.

Learning rate and momentum

SetLearningRate(Rate)

The learning rate is the size of each weight-adjustment step. It is used by Backpropagation, Nesterov, and the whole Adam family (Adam, AdamW, RMSProp, AMSGrad, NAdam, Adagrad). The RPROP-family algorithms (RPROP, SARPROP, iRPROP+, ARPROP) ignore it entirely, and Adadelta uses it only as an overall multiplier (leave it at 1.0).

The right value depends on the algorithm. The Adam family wants a small rate, roughly 0.001–0.005. Plain SGD-style methods use a much larger rate such as 0.1. The default is 0.2. Valid values are greater than 0 and up to 20.

SetMomentum(Momentum)

Momentum smooths the weight updates and helps the optimizer push through flat regions. It applies to Backpropagation and Nesterov. Typical values are 0.7–0.9; valid values are 0 to 1. The default is 0.

Per-optimizer hyperparameters

The Adam family and a few others expose tuning knobs. The defaults are the standard published values and are sensible — most users only ever set the learning rate (and perhaps the AdamW weight decay). Out-of-range values are rejected and leave the setting unchanged.

FunctionRange — default — applies to
SetAdamBeta10 < b < 1 — default 0.9 — Adam, AdamW, AMSGrad, NAdam. Decay rate of the first moment (the smoothed gradient).
SetAdamBeta20 < b < 1 — default 0.999 — Adam, AdamW, AMSGrad, NAdam. Decay rate of the second moment.
SetAdamEpsilone > 0 — default 1e-8 — Adam family, RMSProp, Adagrad. A tiny constant for numerical stability.
SetAdamWeightDecaywd ≥ 0 — default 0.01 — AdamW only. Gently shrinks weights toward zero to reduce overfitting.
SetRMSPropDecay0 < r < 1 — default 0.9 — RMSProp.
SetAdadeltaRho0 < r < 1 — default 0.95 — Adadelta.
SetAdadeltaEpsilone > 0 — default 1e-6 — Adadelta.

Learning-rate schedules

A schedule changes the learning rate as training progresses, rather than holding it fixed. This can help an adaptive optimizer settle into a better solution near the end of training. Schedules apply to all learning-rate-based optimizers (Backprop, Nesterov, the Adam family, RMSProp, Adagrad). The default reproduces the original fixed-rate behaviour.

SetLearningRateSchedule(Type)
TypeSchedule
0Constant — fixed rate (the default).
1Step decay — multiply the rate by a factor every N epochs.
2Cosine annealing — smoothly lower the rate from the base value down to a floor over a set number of epochs. A good general choice.
3SGDR — cosine annealing with warm restarts: the rate periodically jumps back up, helping the optimizer escape plateaus.

The schedule is shaped by these supporting functions:

FunctionRange — default — meaning
SetLRScheduleStep≥ 1 — default 50 — step size (step decay) / length in epochs (cosine) / first cycle length (SGDR). For cosine, set this to your total epoch budget.
SetLRStepDecayFactor0 < f ≤ 1 — default 0.5 — the multiplier applied at each step in step-decay (0.5 halves the rate each step).
SetLRScheduleMinPercent0 ≤ p < 1 — default 0.0 — the floor, as a fraction of the base rate, that cosine / SGDR anneal down to.
SetSGDRCycleMultiplier≥ 1 — default 2.0 — how much each SGDR cycle lengthens relative to the previous one.
SetLearningAlgorithm( 8 );        // AdamW
SetLearningRate( 0.003 );
SetLearningRateSchedule( 2 );     // cosine annealing
SetLRScheduleStep( 300 );         // anneal over the full 300-epoch budget
SetMaximumEpochs( 300 );

Minibatch mode

By default the trainer is full-batch: it looks at all the training data and makes one weight update per epoch. Minibatch mode instead makes one update per small batch of data points, which is where the stochastic Adam-family optimizers earn their advantage on large datasets.

SetBatchSize(Size)

0 (the default) means full batch. A value greater than 0 sets the number of data points per minibatch — 32 to 128 is typical. Use minibatches only with a reasonably large training window (thousands of bars); for small windows, full-batch training with iRPROP+ is preferable.

EnableShuffleData()   DisableShuffleData()

With minibatches you should reshuffle the order of the data points each epoch so the batches differ from pass to pass. EnableShuffleData() turns this on (it uses a deterministic, reproducible shuffle); DisableShuffleData() turns it back off. Shuffling only matters in minibatch mode.

SetLearningAlgorithm( 8 );        // AdamW
SetLearningRate( 0.003 );
SetBatchSize( 64 );               // minibatches of 64
EnableShuffleData();
SetMaximumEpochs( 500 );

Which algorithm should I use?

  • Want it to just work with no tuning: SetLearningAlgorithm(4) — iRPROP+. Robust, ignores the learning rate, and the strongest all-round default for the usual full-batch training.
  • Worried about overfitting / want regularization: SetLearningAlgorithm(8) — AdamW, with a small learning rate (~0.003) and SetAdamWeightDecay, optionally with cosine annealing.
  • Large datasets with minibatches: AdamW (8) or Adam (7) plus SetBatchSize and EnableShuffleData().
  • Avoid: Adagrad (12) on long runs (its rate fades to zero and it stalls), plain Backprop (0), and ARPROP (5).
Note

In like-for-like testing on realistic market series, iRPROP+ and the Adam family all reached similar accuracy. The bigger wins came from the anti-overfitting settings, not the choice of optimizer.