Distributed Strategies

Complete reference for all 10+ distributed training strategies in AiDotNet.



Overview

AiDotNet supports multiple distributed training strategies to scale training across multiple GPUs and nodes:

StrategyMemory per GPUCommunicationBest For
DDPFull modelGradient syncModels that fit in GPU memory
FSDPSharded modelAll-gatherLarge models
ZeRO-1Sharded optimizerOptimizer syncMedium models
ZeRO-2+ Sharded gradientsGradient syncLarge models
ZeRO-3+ Sharded paramsAll-gatherVery large models
PipelineSplit by layersForward/backwardDeep models
TensorSplit operationsAll-reduceWide models

DDP (Distributed Data Parallel)

Replicates the model on each GPU and synchronizes gradients.

Configuration

using AiDotNet.DistributedTraining;

var config = new DistributedConfig
{
    Backend = DistributedBackend.NCCL,
    WorldSize = 4  // Number of GPUs
};

using var context = DistributedContext.Initialize(config);

// Wrap model with DDP
var ddpModel = DDP.Wrap(model);

Training Loop

for (int epoch = 0; epoch < epochs; epoch++)
{
    foreach (var batch in dataLoader)
    {
        var output = ddpModel.Forward(batch.Input);
        var loss = lossFunction.Compute(output, batch.Target);

        ddpModel.Backward(loss);
        optimizer.Step();
        optimizer.ZeroGrad();
    }
}

Multi-Node Setup

export MASTER_ADDR=192.168.1.100
export MASTER_PORT=29500
export WORLD_SIZE=8
export RANK=0
export LOCAL_RANK=0
dotnet run

export MASTER_ADDR=192.168.1.100
export MASTER_PORT=29500
export WORLD_SIZE=8
export RANK=4
export LOCAL_RANK=0
dotnet run

Memory Usage

Model SizeDDP Memory per GPU
1B4 GB
7B28 GB
13B52 GB (may OOM)

FSDP (Fully Sharded Data Parallel)

Shards model parameters across GPUs for memory efficiency.

Configuration

using AiDotNet.DistributedTraining.FSDP;

var fsdpConfig = new FSDPConfig<float>
{
    ShardingStrategy = ShardingStrategy.FullShard,
    MixedPrecision = new FSDPMixedPrecisionConfig
    {
        Enabled = true,
        ParameterDtype = DataType.Float32,
        ReduceDtype = DataType.Float32,
        BufferDtype = DataType.BFloat16
    },
    ActivationCheckpointing = new ActivationCheckpointingConfig
    {
        Enabled = true,
        CheckpointInterval = 2
    }
};

var fsdpModel = FSDP<float>.Wrap(model, fsdpConfig);

Sharding Strategies

StrategyDescriptionMemorySpeed
NoShardNo sharding (like DDP)HighFast
ShardGradOpShard gradients + optimizerMediumMedium
FullShardShard everythingLowSlower
HybridShardFull shard within nodeBalancedBalanced

Memory Comparison

Strategy7B Model Memory/GPU (4 GPUs)
DDP28+ GB (OOM)
SHARD_GRAD_OP~14 GB
FULL_SHARD~8 GB
FULL_SHARD + Checkpointing~5 GB

Wrapping Policies

// Auto-wrap transformer layers
var fsdpConfig = new FSDPConfig<float>
{
    AutoWrapPolicy = new TransformerAutoWrapPolicy
    {
        TransformerLayerClass = typeof(TransformerBlock<float>)
    }
};

// Size-based wrapping
var fsdpConfig = new FSDPConfig<float>
{
    AutoWrapPolicy = new SizeBasedAutoWrapPolicy
    {
        MinNumParams = 100_000_000  // 100M params
    }
};

ZeRO Optimization

DeepSpeed-style memory optimization.

ZeRO Stage 1 (Optimizer State Partitioning)

using AiDotNet.DistributedTraining.ZeRO;

var zero1 = new ZeROOptimizer<float>(
    baseOptimizer: new AdamOptimizer<float>(),
    stage: ZeROStage.Stage1);

ZeRO Stage 2 (+ Gradient Partitioning)

var zero2 = new ZeROOptimizer<float>(
    baseOptimizer: new AdamOptimizer<float>(),
    stage: ZeROStage.Stage2);

ZeRO Stage 3 (+ Parameter Partitioning)

var zero3 = new ZeROOptimizer<float>(
    baseOptimizer: new AdamOptimizer<float>(),
    stage: ZeROStage.Stage3);

Memory Reduction

StageOptimizer StatesGradientsParametersMemory Reduction
Stage 1ShardedReplicatedReplicated~4x
Stage 2ShardedShardedReplicated~8x
Stage 3ShardedShardedSharded~Linear with GPUs

Pipeline Parallelism

Splits model across GPUs by layers.

Configuration

using AiDotNet.DistributedTraining.Pipeline;

var pipelineConfig = new PipelineConfig
{
    NumStages = 4,
    MicroBatchSize = 4,
    NumMicroBatches = 8
};

var stages = new[]
{
    new PipelineStage(layers: model.Layers[..6], device: 0),
    new PipelineStage(layers: model.Layers[6..12], device: 1),
    new PipelineStage(layers: model.Layers[12..18], device: 2),
    new PipelineStage(layers: model.Layers[18..], device: 3)
};

var pipelineModel = Pipeline.Wrap(model, stages, pipelineConfig);

Scheduling

ScheduleDescriptionBubble Overhead
GPipeSimple forward-backwardHigh
1F1BInterleaved forward/backwardLower
Interleaved1F1BMultiple micro-batchesLowest

Tensor Parallelism

Splits individual operations across GPUs.

Configuration

using AiDotNet.DistributedTraining.TensorParallel;

var tpConfig = new TensorParallelConfig
{
    WorldSize = 8,
    ParallelMode = ParallelMode.ColumnParallel
};

// Parallel linear layer
var parallelLinear = new ColumnParallelLinear<float>(
    inputDim: 4096,
    outputDim: 16384,
    config: tpConfig);

Parallel Modes

ModeSplitsBest For
ColumnParallelOutput featuresLinear layers
RowParallelInput featuresAfter column parallel
SequenceParallelSequence dimensionAttention

Using AiModelBuilder

var result = await new AiModelBuilder<float, Tensor<float>, Tensor<float>>()
    .ConfigureModel(largeModel)
    .ConfigureOptimizer(new AdamWOptimizer<float>())
    .ConfigureDistributedTraining(new DistributedConfig
    {
        Strategy = DistributedStrategy.FSDP,
        WorldSize = 8,
        ShardingStrategy = ShardingStrategy.FullShard
    })
    .ConfigureGpuAcceleration(new GpuAccelerationConfig
    {
        Enabled = true,
        MixedPrecision = true
    })
    .BuildAsync(trainData, trainLabels);

Cloud Training

vast.ai

var config = new DistributedConfig
{
    Backend = DistributedBackend.NCCL,
    WorldSize = int.Parse(Environment.GetEnvironmentVariable("WORLD_SIZE") ?? "1"),
    Rank = int.Parse(Environment.GetEnvironmentVariable("RANK") ?? "0"),
    MasterAddress = Environment.GetEnvironmentVariable("MASTER_ADDR") ?? "localhost",
    MasterPort = int.Parse(Environment.GetEnvironmentVariable("MASTER_PORT") ?? "29500")
};

Azure ML

compute:
  instance_type: Standard_NC24ads_A100_v4
  instance_count: 4

distributed:
  type: PyTorch  # Uses NCCL backend

AWS SageMaker

estimator = PyTorch(
    entry_point='train.py',
    instance_type='ml.p4d.24xlarge',
    instance_count=4,
    distribution={'smdistributed': {'dataparallel': {'enabled': True}}}
)

Checkpointing

DDP Checkpointing

// Save (only on rank 0)
if (context.Rank == 0)
{
    model.SaveCheckpoint("checkpoint.pt");
}
context.Barrier();

// Load (all ranks)
model.LoadCheckpoint("checkpoint.pt");

FSDP Checkpointing

// Full state dict (gather to rank 0)
var stateDictConfig = new StateDictConfig
{
    Type = StateDictType.FullStateDict,
    Rank0Only = true
};

fsdpModel.SaveCheckpoint("fsdp_checkpoint.pt", stateDictConfig);

// Sharded state dict (each rank saves own shard)
var shardedConfig = new StateDictConfig
{
    Type = StateDictType.ShardedStateDict
};

fsdpModel.SaveCheckpoint("fsdp_sharded/", shardedConfig);

Troubleshooting

NCCL Timeout

export NCCL_DEBUG=INFO
export NCCL_SOCKET_TIMEOUT=300000

Memory Issues

// Reduce batch size
// Enable gradient checkpointing
fsdpConfig.ActivationCheckpointing.Enabled = true;

// Use FULL_SHARD
fsdpConfig.ShardingStrategy = ShardingStrategy.FullShard;

Slow Communication

nvidia-smi topo -m

export NCCL_IB_DISABLE=1

Strategy Selection Guide

Model SizeGPUsRecommended Strategy
< 1B1-4DDP
1B - 7B2-8DDP or FSDP ShardGradOp
7B - 13B4-8FSDP FullShard
13B - 70B8-16FSDP + ZeRO-3
> 70B16+FSDP + Pipeline + Tensor

Best Practices

  1. Start with DDP: Use FSDP only when necessary
  2. Enable mixed precision: FP16/BF16 for faster training
  3. Use gradient accumulation: Increase effective batch size
  4. Enable activation checkpointing: Trade compute for memory
  5. Profile communication: Ensure NCCL is performing well
  6. Use appropriate batch size: Scale with world size
  7. Monitor GPU utilization: Should be > 90%