Optimizers
40+Every optimization algorithm you need for training neural networks
From classic SGD with momentum to cutting-edge optimizers like Lion and Sophia. All 40+ optimizers support learning rate scheduling, weight decay, gradient clipping, and mixed-precision training out of the box.
Adaptive Learning Rate
Optimizers that automatically adjust learning rates per-parameter.
Adam
Adaptive Moment Estimation combining first and second moment estimates.
AdamW
Adam with decoupled weight decay regularization (the standard for transformers).
RAdam
Rectified Adam with variance-adaptive learning rate warmup.
NAdam
Nesterov-accelerated Adam combining Nesterov momentum with Adam.
Adan
Adaptive Nesterov Momentum for faster convergence.
Lion
Evolved Sign Momentum optimizer found by program search, more memory efficient than Adam.
Prodigy
Automatic learning rate estimation requiring zero LR tuning.
Sophia
Second-order optimizer using diagonal Hessian for faster LLM training.
CAME
Confidence-guided Adaptive Memory Efficient optimizer for LLMs.
AdaBelief
Adapting step sizes by the belief in observed gradients.
Large-Batch Optimizers
Designed for distributed training with large batch sizes.
LAMB
Layer-wise Adaptive Moments for scaling BERT training to batch size 65K.
LARS
Layer-wise Adaptive Rate Scaling for large-batch SGD training.
NovoGrad
Stochastic gradient descent with layer-wise gradient normalization.
Memory-Efficient
Optimizers that reduce memory footprint for training large models.
Adafactor
Row/column factored second moments reducing memory from O(mn) to O(m+n).
SM3
Memory-efficient adaptive optimization with cover-based second moments.
8-bit Adam
Quantized optimizer states for 75% memory reduction.
GaLore Optimizer
Gradient low-rank projection for full-rank training at LoRA-level memory.
Classic Optimizers
Foundational optimization algorithms with proven convergence.
SGD + Momentum
Stochastic gradient descent with momentum for accelerated convergence.
Nesterov
Nesterov accelerated gradient with look-ahead momentum.
RMSProp
Root mean square propagation with per-parameter learning rates.
Rprop
Resilient backpropagation using only gradient signs.
Optimizers with AiModelBuilder
using AiDotNet;
// Train with a specific optimizer using AiModelBuilder
var result = await new AiModelBuilder<float, float[], float>()
.ConfigureModel(new NeuralNetwork<float>(
inputSize: 784, hiddenSize: 128, outputSize: 10))
.ConfigureOptimizer(new AdamWOptimizer<float>(
learningRate: 1e-3f, weightDecay: 0.01f))
.ConfigurePreprocessing()
.BuildAsync(features, labels);
var prediction = result.Predict(newSample); Start building with Optimizers
All 40+ implementations are included free under Apache 2.0.