Audio Processing

Learn to process audio with AiDotNet’s speech and audio APIs.



What is Audio Processing?

Audio processing with machine learning includes:

  • Speech-to-Text (transcription)
  • Text-to-Speech (synthesis)
  • Speaker recognition and diarization
  • Audio classification
  • Music generation and analysis

AiDotNet Audio Capabilities

FeatureDescriptionModel
TranscriptionConvert speech to textWhisper
TTSConvert text to speechVITS, Tacotron2
Speaker IDIdentify who is speakingECAPA-TDNN
DiarizationSegment by speakerSpeakerDiarizer
ClassificationClassify audio eventsPANNs, AST
EnhancementRemove noiseDeepFilterNet, DCCRN

Quick Start - Speech Transcription

using AiDotNet.Audio;
using AiDotNet.Audio.Whisper;

// Load Whisper model
var whisper = new WhisperModel<float>(new WhisperOptions
{
    ModelSize = WhisperModelSize.Base,
    Language = "en"
});

// Transcribe audio file
var result = await whisper.TranscribeAsync("speech.wav");

Console.WriteLine($"Transcription: {result.Text}");

// Access segments with timestamps
foreach (var segment in result.Segments)
{
    Console.WriteLine($"[{segment.Start:F2}s - {segment.End:F2}s] {segment.Text}");
}

Speech Recognition (Whisper)

Model Sizes

SizeParametersEnglishMultilingualSpeed
Tiny39MGoodFairFastest
Base74MBetterGoodFast
Small244MGreatGreatMedium
Medium769MExcellentExcellentSlow
Large1.5BBestBestSlowest

Basic Transcription

var whisper = new WhisperModel<float>(new WhisperOptions
{
    ModelSize = WhisperModelSize.Small,
    Device = DeviceType.GPU  // Use GPU acceleration
});

// From file
var resultFromFile = await whisper.TranscribeAsync("audio.wav");

// From stream
using var stream = File.OpenRead("audio.wav");
var resultFromStream = await whisper.TranscribeAsync(stream);

// From byte array
var audioData = await File.ReadAllBytesAsync("audio.wav");
var resultFromBytes = await whisper.TranscribeAsync(audioData);

Advanced Options

var result = await whisper.TranscribeAsync("audio.wav", new TranscriptionOptions
{
    Language = "en",              // Force language (auto-detect if null)
    Task = WhisperTask.Transcribe,  // or WhisperTask.Translate
    WordTimestamps = true,        // Get word-level timestamps
    BeamSize = 5,                 // Beam search width
    Temperature = 0.0f,           // Sampling temperature (0 = greedy)
    VadFilter = true,             // Voice activity detection
    InitialPrompt = "Technical meeting about AI"  // Context hint
});

// Access word-level timestamps
foreach (var word in result.Words)
{
    Console.WriteLine($"{word.Word} [{word.Start:F2}s - {word.End:F2}s] (conf: {word.Probability:P0})");
}

Real-time Streaming

// Real-time transcription from microphone
var whisper = new WhisperModel<float>(new WhisperOptions
{
    ModelSize = WhisperModelSize.Base,
    StreamingMode = true
});

await foreach (var segment in whisper.TranscribeStreamAsync(microphoneStream))
{
    Console.Write(segment.Text);  // Stream output as it's transcribed
}

Text-to-Speech (TTS)

using AiDotNet.Audio.TextToSpeech;

var tts = new VITSModel<float>(new TtsOptions
{
    Language = "en",
    SpeakerId = 0  // Multi-speaker models support different voices
});

// Generate speech
var audio = await tts.SynthesizeAsync("Hello, welcome to AiDotNet!");

// Save to file
await audio.SaveAsync("output.wav");

// Or get raw samples
var samples = audio.GetSamples();

Tacotron2 Model

var tts = new Tacotron2Model<float>(new TtsOptions
{
    VocoderType = VocoderType.HiFiGAN,
    SpeakingRate = 1.0f,
    Pitch = 1.0f
});

var audio = await tts.SynthesizeAsync("Converting text to natural speech.");

Adjusting Voice

// Speed up/slow down
audio = await tts.SynthesizeAsync("Fast speech", speakingRate: 1.5f);
audio = await tts.SynthesizeAsync("Slow speech", speakingRate: 0.75f);

// Adjust pitch
audio = await tts.SynthesizeAsync("Higher pitch", pitch: 1.2f);
audio = await tts.SynthesizeAsync("Lower pitch", pitch: 0.8f);

// Add pauses with SSML
var ssml = "<speak>Hello. <break time='500ms'/> How are you?</speak>";
audio = await tts.SynthesizeSsmlAsync(ssml);

Speaker Recognition

Speaker Verification

Verify if two audio samples are from the same speaker.

using AiDotNet.Audio.Speaker;

var verifier = new SpeakerVerifier<float>(new SpeakerVerifierOptions
{
    Threshold = 0.5f  // Similarity threshold
});

// Compare two audio samples
var result = await verifier.VerifyAsync("sample1.wav", "sample2.wav");

Console.WriteLine($"Same speaker: {result.IsSameSpeaker}");
Console.WriteLine($"Similarity: {result.Similarity:F4}");
Console.WriteLine($"Confidence: {result.Confidence:P0}");

Speaker Identification

Identify a speaker from a known set.

var identifier = new SpeakerEmbeddingExtractor<float>();

// Enroll speakers
await identifier.EnrollAsync("alice", "alice_sample1.wav");
await identifier.EnrollAsync("alice", "alice_sample2.wav");  // Multiple samples improve accuracy
await identifier.EnrollAsync("bob", "bob_sample1.wav");
await identifier.EnrollAsync("charlie", "charlie_sample1.wav");

// Identify unknown speaker
var result = await identifier.IdentifyAsync("unknown.wav");

Console.WriteLine($"Identified: {result.SpeakerId}");
Console.WriteLine($"Confidence: {result.Confidence:P0}");

// Get rankings
foreach (var match in result.Matches.OrderByDescending(m => m.Score).Take(3))
{
    Console.WriteLine($"  {match.SpeakerId}: {match.Score:F4}");
}

Speaker Diarization

Segment audio by speaker (who spoke when).

var diarizer = new SpeakerDiarizer<float>(new SpeakerDiarizerOptions
{
    MinSpeakers = 2,
    MaxSpeakers = 10,
    MinSegmentDuration = 0.5  // Minimum segment length in seconds
});

var result = await diarizer.DiarizeAsync("meeting.wav");

Console.WriteLine($"Detected {result.NumSpeakers} speakers");
Console.WriteLine();

foreach (var turn in result.Turns)
{
    Console.WriteLine($"[{turn.Start:F2}s - {turn.End:F2}s] Speaker {turn.SpeakerId}");
}

Audio Classification

Event Detection

using AiDotNet.Audio.Classification;

var detector = new AudioEventDetector<float>(new AudioEventDetectorOptions
{
    MinConfidence = 0.5f
});

var events = await detector.DetectAsync("audio.wav");

foreach (var evt in events)
{
    Console.WriteLine($"[{evt.Start:F2}s] {evt.Label} (confidence: {evt.Confidence:P0})");
}
// Example output:
// [0.00s] Speech (confidence: 95%)
// [3.50s] Dog bark (confidence: 87%)
// [5.20s] Door slam (confidence: 72%)

Music Genre Classification

var classifier = new GenreClassifier<float>();

var result = await classifier.ClassifyAsync("song.mp3");

Console.WriteLine($"Genre: {result.PredictedGenre}");
Console.WriteLine($"Confidence: {result.Confidence:P0}");

// Get all predictions
foreach (var (genre, prob) in result.AllProbabilities.OrderByDescending(p => p.Value).Take(5))
{
    Console.WriteLine($"  {genre}: {prob:P1}");
}

Scene Classification

var sceneClassifier = new SceneClassifier<float>();

var result = await sceneClassifier.ClassifyAsync("recording.wav");

Console.WriteLine($"Scene: {result.PredictedScene}");  // e.g., "office", "street", "park"

Audio Enhancement

Noise Reduction

using AiDotNet.Audio.Enhancement;

var denoiser = new DeepFilterNet<float>(new DeepFilterNetOptions
{
    AttenLimit = 100,  // Maximum noise attenuation in dB
    PostFilter = true
});

var cleanAudio = await denoiser.EnhanceAsync("noisy_speech.wav");
await cleanAudio.SaveAsync("clean_speech.wav");

Speech Enhancement

var enhancer = new DCCRN<float>();

var enhanced = await enhancer.EnhanceAsync("poor_quality.wav");
await enhanced.SaveAsync("enhanced.wav");

Audio Feature Extraction

Mel-Frequency Cepstral Coefficients (MFCCs)

using AiDotNet.Audio.Features;

var mfccExtractor = new MfccExtractor<float>(new MfccOptions
{
    NumCoefficients = 13,
    SampleRate = 16000,
    FrameLength = 0.025f,  // 25ms
    FrameStep = 0.010f     // 10ms
});

var mfccs = mfccExtractor.Extract("audio.wav");
// Shape: [numFrames, numCoefficients]

Spectrogram

var spectralExtractor = new SpectralFeatureExtractor<float>(new SpectralFeatureOptions
{
    FeatureType = SpectralFeatureType.MelSpectrogram,
    NumMelBins = 80,
    WindowType = WindowType.Hann
});

var spectrogram = spectralExtractor.Extract("audio.wav");

Chroma Features

var chromaExtractor = new ChromaExtractor<float>(new ChromaOptions
{
    NumChroma = 12,
    Tuning = 440.0f  // A4 = 440 Hz
});

var chroma = chromaExtractor.Extract("music.wav");

Music Analysis

Beat Tracking

using AiDotNet.Audio.MusicAnalysis;

var beatTracker = new BeatTracker<float>();

var result = await beatTracker.TrackAsync("song.mp3");

Console.WriteLine($"BPM: {result.Tempo:F1}");

foreach (var beat in result.Beats)
{
    Console.WriteLine($"Beat at {beat:F3}s");
}

Key Detection

var keyDetector = new KeyDetector<float>();

var result = await keyDetector.DetectAsync("song.mp3");

Console.WriteLine($"Key: {result.Key} {result.Mode}");  // e.g., "C Major"
Console.WriteLine($"Confidence: {result.Confidence:P0}");

Chord Recognition

var chordRecognizer = new ChordRecognizer<float>();

var result = await chordRecognizer.RecognizeAsync("song.mp3");

foreach (var segment in result.Segments)
{
    Console.WriteLine($"[{segment.Start:F2}s - {segment.End:F2}s] {segment.Chord}");
}

Audio Formats and I/O

Supported Formats

  • WAV (PCM, float)
  • MP3
  • FLAC
  • OGG
  • M4A/AAC

Loading Audio

// From file
var audio = AudioData.Load("input.wav");

// From URL
var audio = await AudioData.LoadFromUrlAsync("https://example.com/audio.wav");

// From stream
using var stream = File.OpenRead("input.wav");
var audio = AudioData.Load(stream);

Resampling

// Resample to 16kHz (common for speech models)
var resampled = audio.Resample(targetSampleRate: 16000);

// Convert to mono
var mono = audio.ToMono();

// Normalize volume
var normalized = audio.Normalize();

Best Practices

  1. Sample Rate: Most speech models expect 16kHz audio
  2. Mono Audio: Convert stereo to mono for most speech tasks
  3. Preprocessing: Normalize audio and remove silence for better results
  4. GPU Acceleration: Use GPU for large files or real-time processing
  5. Batching: Process multiple files in batches for efficiency

Common Issues

Poor Transcription Quality

  • Ensure audio is clear with minimal background noise
  • Use larger Whisper models for difficult audio
  • Consider preprocessing with noise reduction

Slow Processing

  • Use smaller models for real-time applications
  • Enable GPU acceleration
  • Use streaming mode for long audio files

Memory Issues

  • Process long audio in chunks
  • Use streaming APIs
  • Reduce model size if memory-constrained

Next Steps