Optimizers

Complete reference for all 42+ optimization algorithms in AiDotNet.



First-Order Optimizers

SGD Family

OptimizerDescriptionUse Case
SGDOptimizer<T>Stochastic Gradient DescentGeneral purpose
MomentumOptimizer<T>SGD with momentumFaster convergence
NesterovOptimizer<T>Nesterov Accelerated GradientLook-ahead momentum
var optimizer = new SGDOptimizer<float>(
    learningRate: 0.01f,
    momentum: 0.9f,
    weightDecay: 1e-4f,
    nesterov: true);

Adam Family

OptimizerDescriptionUse Case
AdamOptimizer<T>Adaptive Moment EstimationGeneral deep learning
AdamWOptimizer<T>Adam with decoupled weight decayTransformers, large models
AdaMaxOptimizer<T>Adam with infinity normSparse gradients
NAdamOptimizer<T>Nesterov AdamImproved convergence
RAdam<T>Rectified AdamStable training
AdamP<T>Adam with projectionVision models
var optimizer = new AdamWOptimizer<float>(
    learningRate: 3e-4f,
    beta1: 0.9f,
    beta2: 0.999f,
    epsilon: 1e-8f,
    weightDecay: 0.01f);

Adaptive Learning Rate

OptimizerDescriptionUse Case
AdaGradOptimizer<T>Adaptive gradientSparse features
AdaDeltaOptimizer<T>Adaptive deltaNo learning rate tuning
RMSpropOptimizer<T>Root Mean Square propagationRNNs
AdaFactorOptimizer<T>Memory-efficient adaptiveLarge models

LAMB/LARS

OptimizerDescriptionUse Case
LARSOptimizer<T>Layer-wise Adaptive Rate ScalingLarge batch training
LAMBOptimizer<T>Layer-wise Adaptive MomentsBERT pre-training
var optimizer = new LAMBOptimizer<float>(
    learningRate: 0.001f,
    beta1: 0.9f,
    beta2: 0.999f,
    trustCoeff: 0.001f);

Modern Optimizers

OptimizerDescriptionUse Case
LionOptimizer<T>Evolved Sign MomentumVision, language models
Prodigy<T>Automatic learning rateNo tuning needed
ScheduleFree<T>No schedule neededSimplified training
Sophia<T>Second-order informationLLM training
Muon<T>Momentum-basedResearch
var optimizer = new LionOptimizer<float>(
    learningRate: 1e-4f,
    beta1: 0.9f,
    beta2: 0.99f,
    weightDecay: 0.0f);

Second-Order Optimizers

OptimizerDescriptionUse Case
LBFGSOptimizer<T>Limited-memory BFGSSmall models, full batch
NewtonOptimizer<T>Newton’s methodConvex optimization
KFACOptimizer<T>Kronecker-factored curvatureDeep networks
ShampooOptimizer<T>PreconditioningLarge-scale training
var optimizer = new LBFGSOptimizer<double>(
    maxIterations: 20,
    historySize: 10,
    lineSearch: LineSearch.StrongWolfe);

Sparse Optimizers

OptimizerDescriptionUse Case
SparseAdamOptimizer<T>Adam for sparse gradientsEmbeddings
LazyAdamOptimizer<T>Lazy parameter updatesLarge sparse models

Evolutionary Optimizers

OptimizerDescriptionUse Case
GeneticOptimizer<T>Genetic AlgorithmHyperparameter search
EvolutionStrategy<T>Evolution strategiesNeural architecture
CMAESOptimizer<T>Covariance Matrix AdaptationBlack-box optimization
ParticleSwarmOptimizer<T>Particle SwarmGlobal optimization
DifferentialEvolution<T>Differential evolutionContinuous optimization
var optimizer = new GeneticOptimizer<double>(
    populationSize: 100,
    mutationRate: 0.1,
    crossoverRate: 0.8,
    elitismCount: 5);

Learning Rate Schedulers

Step-Based

SchedulerDescription
StepLRDecay by factor at fixed intervals
MultiStepLRDecay at specified milestones
ExponentialLRExponential decay
var scheduler = new StepLR(
    optimizer: optimizer,
    stepSize: 30,
    gamma: 0.1f);

Epoch-Based

SchedulerDescription
CosineAnnealingLRCosine annealing
CosineAnnealingWarmRestartsCosine with restarts
LinearLRLinear decay
PolynomialLRPolynomial decay
var scheduler = new CosineAnnealingLR(
    optimizer: optimizer,
    tMax: 100,
    etaMin: 1e-6f);

Warmup Schedulers

SchedulerDescription
WarmupLinearScheduleLinear warmup
WarmupCosineScheduleWarmup + cosine decay
WarmupConstantScheduleWarmup + constant
var scheduler = new WarmupCosineSchedule(
    optimizer: optimizer,
    warmupSteps: 1000,
    totalSteps: 100000);

Adaptive Schedulers

SchedulerDescription
ReduceLROnPlateauReduce when metric plateaus
CyclicLRCyclic learning rate
OneCycleLROne cycle policy
var scheduler = new ReduceLROnPlateau(
    optimizer: optimizer,
    mode: "min",
    factor: 0.1f,
    patience: 10);

Usage Examples

Basic Training Loop

var optimizer = new AdamWOptimizer<float>(learningRate: 3e-4f);
var scheduler = new CosineAnnealingLR(optimizer, tMax: epochs);

for (int epoch = 0; epoch < epochs; epoch++)
{
    foreach (var batch in dataLoader)
    {
        optimizer.ZeroGrad();
        var loss = model.Forward(batch.Input);
        loss.Backward();
        optimizer.Step();
    }
    scheduler.Step();
}

With AiModelBuilder

var result = await new AiModelBuilder<float, Tensor<float>, int>()
    .ConfigureModel(model)
    .ConfigureOptimizer(new AdamWOptimizer<float>(learningRate: 3e-4f))
    .ConfigureLearningRateScheduler(new CosineAnnealingLR(tMax: 100))
    .BuildAsync(trainData, trainLabels);

Gradient Clipping

var optimizer = new AdamWOptimizer<float>(
    learningRate: 3e-4f,
    maxGradNorm: 1.0f);  // Gradient clipping

Optimizer Selection Guide

TaskRecommended Optimizer
General deep learningAdamW
Transformers/LLMsAdamW + cosine schedule
Large batch trainingLAMB
Vision modelsSGD + momentum or Lion
RNNsRMSprop or Adam
Small datasetsL-BFGS
Hyperparameter searchGenetic/PSO
Memory-constrainedAdaFactor

Hyperparameter Guidelines

ParameterTypical RangeNotes
Learning rate1e-5 to 1e-2Start with 3e-4 for Adam
Weight decay0 to 0.10.01 is common
Beta1 (momentum)0.9 to 0.950.9 is standard
Beta20.99 to 0.9990.999 for Adam
Epsilon1e-8 to 1e-61e-8 is standard