Optimizers

40+

Every optimization algorithm you need for training neural networks

From classic SGD with momentum to cutting-edge optimizers like Lion and Sophia. All 40+ optimizers support learning rate scheduling, weight decay, gradient clipping, and mixed-precision training out of the box.

Neural Network Training Fine-Tuning Large-Scale Training Edge Deployment Research AutoML

Adaptive Learning Rate

Optimizers that automatically adjust learning rates per-parameter.

Adam

Adaptive Moment Estimation combining first and second moment estimates.

AdamW

Adam with decoupled weight decay regularization (the standard for transformers).

RAdam

Rectified Adam with variance-adaptive learning rate warmup.

NAdam

Nesterov-accelerated Adam combining Nesterov momentum with Adam.

Adan

Adaptive Nesterov Momentum for faster convergence.

Lion

Evolved Sign Momentum optimizer found by program search, more memory efficient than Adam.

Prodigy

Automatic learning rate estimation requiring zero LR tuning.

Sophia

Second-order optimizer using diagonal Hessian for faster LLM training.

CAME

Confidence-guided Adaptive Memory Efficient optimizer for LLMs.

AdaBelief

Adapting step sizes by the belief in observed gradients.

Large-Batch Optimizers

Designed for distributed training with large batch sizes.

LAMB

Layer-wise Adaptive Moments for scaling BERT training to batch size 65K.

LARS

Layer-wise Adaptive Rate Scaling for large-batch SGD training.

NovoGrad

Stochastic gradient descent with layer-wise gradient normalization.

Memory-Efficient

Optimizers that reduce memory footprint for training large models.

Adafactor

Row/column factored second moments reducing memory from O(mn) to O(m+n).

SM3

Memory-efficient adaptive optimization with cover-based second moments.

8-bit Adam

Quantized optimizer states for 75% memory reduction.

GaLore Optimizer

Gradient low-rank projection for full-rank training at LoRA-level memory.

Classic Optimizers

Foundational optimization algorithms with proven convergence.

SGD + Momentum

Stochastic gradient descent with momentum for accelerated convergence.

Nesterov

Nesterov accelerated gradient with look-ahead momentum.

RMSProp

Root mean square propagation with per-parameter learning rates.

Rprop

Resilient backpropagation using only gradient signs.

Optimizers with AiModelBuilder

C#
using AiDotNet;

// Train with a specific optimizer using AiModelBuilder
var result = await new AiModelBuilder<float, float[], float>()
    .ConfigureModel(new NeuralNetwork<float>(
        inputSize: 784, hiddenSize: 128, outputSize: 10))
    .ConfigureOptimizer(new AdamWOptimizer<float>(
        learningRate: 1e-3f, weightDecay: 0.01f))
    .ConfigurePreprocessing()
    .BuildAsync(features, labels);

var prediction = result.Predict(newSample);

Start building with Optimizers

All 40+ implementations are included free under Apache 2.0.