Distributed Strategies

Complete reference for all 10+ distributed training strategies in AiDotNet.

Overview

AiDotNet supports multiple distributed training strategies to scale training across multiple GPUs and nodes:

Strategy	Memory per GPU	Communication	Best For
DDP	Full model	Gradient sync	Models that fit in GPU memory
FSDP	Sharded model	All-gather	Large models
ZeRO-1	Sharded optimizer	Optimizer sync	Medium models
ZeRO-2	+ Sharded gradients	Gradient sync	Large models
ZeRO-3	+ Sharded params	All-gather	Very large models
Pipeline	Split by layers	Forward/backward	Deep models
Tensor	Split operations	All-reduce	Wide models

DDP (Distributed Data Parallel)

Replicates the model on each GPU and synchronizes gradients.

Configuration

using AiDotNet.DistributedTraining;

var config = new DistributedConfig
{
    Backend = DistributedBackend.NCCL,
    WorldSize = 4  // Number of GPUs
};

using var context = DistributedContext.Initialize(config);

// Wrap model with DDP
var ddpModel = DDP.Wrap(model);

Training Loop

for (int epoch = 0; epoch < epochs; epoch++)
{
    foreach (var batch in dataLoader)
    {
        var output = ddpModel.Forward(batch.Input);
        var loss = lossFunction.Compute(output, batch.Target);

        ddpModel.Backward(loss);
        optimizer.Step();
        optimizer.ZeroGrad();
    }
}

Multi-Node Setup

export MASTER_ADDR=192.168.1.100
export MASTER_PORT=29500
export WORLD_SIZE=8
export RANK=0
export LOCAL_RANK=0
dotnet run

export MASTER_ADDR=192.168.1.100
export MASTER_PORT=29500
export WORLD_SIZE=8
export RANK=4
export LOCAL_RANK=0
dotnet run

Memory Usage

Model Size	DDP Memory per GPU
1B	4 GB
7B	28 GB
13B	52 GB (may OOM)

FSDP (Fully Sharded Data Parallel)

Shards model parameters across GPUs for memory efficiency.

Configuration

using AiDotNet.DistributedTraining.FSDP;

var fsdpConfig = new FSDPConfig<float>
{
    ShardingStrategy = ShardingStrategy.FullShard,
    MixedPrecision = new FSDPMixedPrecisionConfig
    {
        Enabled = true,
        ParameterDtype = DataType.Float32,
        ReduceDtype = DataType.Float32,
        BufferDtype = DataType.BFloat16
    },
    ActivationCheckpointing = new ActivationCheckpointingConfig
    {
        Enabled = true,
        CheckpointInterval = 2
    }
};

var fsdpModel = FSDP<float>.Wrap(model, fsdpConfig);

Sharding Strategies

Strategy	Description	Memory	Speed
`NoShard`	No sharding (like DDP)	High	Fast
`ShardGradOp`	Shard gradients + optimizer	Medium	Medium
`FullShard`	Shard everything	Low	Slower
`HybridShard`	Full shard within node	Balanced	Balanced

Memory Comparison

Strategy	7B Model Memory/GPU (4 GPUs)
DDP	28+ GB (OOM)
SHARD_GRAD_OP	~14 GB
FULL_SHARD	~8 GB
FULL_SHARD + Checkpointing	~5 GB

Wrapping Policies

// Auto-wrap transformer layers
var fsdpConfig = new FSDPConfig<float>
{
    AutoWrapPolicy = new TransformerAutoWrapPolicy
    {
        TransformerLayerClass = typeof(TransformerBlock<float>)
    }
};

// Size-based wrapping
var fsdpConfig = new FSDPConfig<float>
{
    AutoWrapPolicy = new SizeBasedAutoWrapPolicy
    {
        MinNumParams = 100_000_000  // 100M params
    }
};

ZeRO Optimization

DeepSpeed-style memory optimization.

ZeRO Stage 1 (Optimizer State Partitioning)

using AiDotNet.DistributedTraining.ZeRO;

var zero1 = new ZeROOptimizer<float>(
    baseOptimizer: new AdamOptimizer<float>(),
    stage: ZeROStage.Stage1);

ZeRO Stage 2 (+ Gradient Partitioning)

var zero2 = new ZeROOptimizer<float>(
    baseOptimizer: new AdamOptimizer<float>(),
    stage: ZeROStage.Stage2);

ZeRO Stage 3 (+ Parameter Partitioning)

var zero3 = new ZeROOptimizer<float>(
    baseOptimizer: new AdamOptimizer<float>(),
    stage: ZeROStage.Stage3);

Memory Reduction

Stage	Optimizer States	Gradients	Parameters	Memory Reduction
Stage 1	Sharded	Replicated	Replicated	~4x
Stage 2	Sharded	Sharded	Replicated	~8x
Stage 3	Sharded	Sharded	Sharded	~Linear with GPUs

Pipeline Parallelism

Splits model across GPUs by layers.

Configuration

using AiDotNet.DistributedTraining.Pipeline;

var pipelineConfig = new PipelineConfig
{
    NumStages = 4,
    MicroBatchSize = 4,
    NumMicroBatches = 8
};

var stages = new[]
{
    new PipelineStage(layers: model.Layers[..6], device: 0),
    new PipelineStage(layers: model.Layers[6..12], device: 1),
    new PipelineStage(layers: model.Layers[12..18], device: 2),
    new PipelineStage(layers: model.Layers[18..], device: 3)
};

var pipelineModel = Pipeline.Wrap(model, stages, pipelineConfig);

Scheduling

Schedule	Description	Bubble Overhead
`GPipe`	Simple forward-backward	High
`1F1B`	Interleaved forward/backward	Lower
`Interleaved1F1B`	Multiple micro-batches	Lowest

Tensor Parallelism

Splits individual operations across GPUs.

Configuration

using AiDotNet.DistributedTraining.TensorParallel;

var tpConfig = new TensorParallelConfig
{
    WorldSize = 8,
    ParallelMode = ParallelMode.ColumnParallel
};

// Parallel linear layer
var parallelLinear = new ColumnParallelLinear<float>(
    inputDim: 4096,
    outputDim: 16384,
    config: tpConfig);

Parallel Modes

Mode	Splits	Best For
`ColumnParallel`	Output features	Linear layers
`RowParallel`	Input features	After column parallel
`SequenceParallel`	Sequence dimension	Attention

Using AiModelBuilder

var result = await new AiModelBuilder<float, Tensor<float>, Tensor<float>>()
    .ConfigureModel(largeModel)
    .ConfigureOptimizer(new AdamWOptimizer<float>())
    .ConfigureDistributedTraining(new DistributedConfig
    {
        Strategy = DistributedStrategy.FSDP,
        WorldSize = 8,
        ShardingStrategy = ShardingStrategy.FullShard
    })
    .ConfigureGpuAcceleration(new GpuAccelerationConfig
    {
        Enabled = true,
        MixedPrecision = true
    })
    .BuildAsync(trainData, trainLabels);

Cloud Training

vast.ai

var config = new DistributedConfig
{
    Backend = DistributedBackend.NCCL,
    WorldSize = int.Parse(Environment.GetEnvironmentVariable("WORLD_SIZE") ?? "1"),
    Rank = int.Parse(Environment.GetEnvironmentVariable("RANK") ?? "0"),
    MasterAddress = Environment.GetEnvironmentVariable("MASTER_ADDR") ?? "localhost",
    MasterPort = int.Parse(Environment.GetEnvironmentVariable("MASTER_PORT") ?? "29500")
};

Azure ML

compute:
  instance_type: Standard_NC24ads_A100_v4
  instance_count: 4

distributed:
  type: PyTorch  # Uses NCCL backend

AWS SageMaker

estimator = PyTorch(
    entry_point='train.py',
    instance_type='ml.p4d.24xlarge',
    instance_count=4,
    distribution={'smdistributed': {'dataparallel': {'enabled': True}}}
)

Checkpointing

DDP Checkpointing

// Save (only on rank 0)
if (context.Rank == 0)
{
    model.SaveCheckpoint("checkpoint.pt");
}
context.Barrier();

// Load (all ranks)
model.LoadCheckpoint("checkpoint.pt");

FSDP Checkpointing

// Full state dict (gather to rank 0)
var stateDictConfig = new StateDictConfig
{
    Type = StateDictType.FullStateDict,
    Rank0Only = true
};

fsdpModel.SaveCheckpoint("fsdp_checkpoint.pt", stateDictConfig);

// Sharded state dict (each rank saves own shard)
var shardedConfig = new StateDictConfig
{
    Type = StateDictType.ShardedStateDict
};

fsdpModel.SaveCheckpoint("fsdp_sharded/", shardedConfig);

Troubleshooting

NCCL Timeout

export NCCL_DEBUG=INFO
export NCCL_SOCKET_TIMEOUT=300000

Memory Issues

// Reduce batch size
// Enable gradient checkpointing
fsdpConfig.ActivationCheckpointing.Enabled = true;

// Use FULL_SHARD
fsdpConfig.ShardingStrategy = ShardingStrategy.FullShard;

Slow Communication

nvidia-smi topo -m

export NCCL_IB_DISABLE=1

Strategy Selection Guide

Model Size	GPUs	Recommended Strategy
< 1B	1-4	DDP
1B - 7B	2-8	DDP or FSDP ShardGradOp
7B - 13B	4-8	FSDP FullShard
13B - 70B	8-16	FSDP + ZeRO-3
> 70B	16+	FSDP + Pipeline + Tensor

Best Practices

Start with DDP: Use FSDP only when necessary
Enable mixed precision: FP16/BF16 for faster training
Use gradient accumulation: Increase effective batch size
Enable activation checkpointing: Trade compute for memory
Profile communication: Ensure NCCL is performing well
Use appropriate batch size: Scale with world size
Monitor GPU utilization: Should be > 90%