Audio, Speech & TTS
250+The most comprehensive audio AI pipeline in any .NET library
From Whisper speech recognition to VITS text-to-speech, from MusicGen generation to RVC voice cloning. AiDotNet covers the entire audio AI pipeline - speech recognition, text-to-speech, music generation, audio classification, voice conversion, and more - all in pure C#.
Speech Recognition (ASR)
Convert speech to text with state-of-the-art accuracy across 100+ languages.
Whisper v3
OpenAI Whisper with 100+ language support and robust accuracy.
wav2vec 2.0
Self-supervised speech representation learning from raw audio.
HuBERT
Hidden-Unit BERT with offline clustering for speech representations.
WavLM
Large-scale self-supervised speech model with denoising pretraining.
Conformer
Convolution-augmented transformer for accurate ASR.
Canary
NVIDIA multi-language ASR with fast-conformer architecture.
USM
Universal Speech Model for 300+ languages.
SeamlessM4T
Meta translation model for speech and text across 100 languages.
Text-to-Speech (TTS)
Generate natural-sounding speech from text with voice control and emotion.
VITS / VITS2
End-to-end TTS with variational inference and adversarial training.
Tacotron 2
Sequence-to-sequence TTS with location-sensitive attention.
FastSpeech 2
Non-autoregressive TTS with duration, pitch, and energy control.
NaturalSpeech
Human-level TTS with variational autoencoder and flow matching.
StyleTTS 2
Style-controllable TTS with diffusion and adversarial training.
XTTS
Cross-lingual TTS with voice cloning from 6-second reference.
Bark
Transformer-based TTS with music, sound effects, and multilingual support.
Tortoise TTS
High-quality multi-voice TTS with CLVP and diffusion decoder.
Piper
Fast, lightweight TTS optimized for edge and offline use.
MetaVoice
Foundation model for human-like, emotional speech generation.
Voice Conversion & Cloning
Transform voice characteristics while preserving content and emotion.
RVC
Retrieval-based Voice Conversion with high quality and real-time speed.
So-VITS-SVC
Singing voice conversion with VITS backbone.
OpenVoice
Instant voice cloning with fine-grained style control.
VALL-E
Neural codec language model for zero-shot voice cloning.
VALL-E X
Cross-lingual speech synthesis with zero-shot voice cloning.
FreeVC
Text-free voice conversion without parallel data.
Music & Sound Generation
Generate music, sound effects, and audio from text descriptions.
MusicGen
Meta text-to-music generation with melody conditioning.
AudioCraft
Meta audio generation suite for music, sound effects, and compression.
Stable Audio
Stability AI diffusion-based audio and music generation.
AudioLDM / AudioLDM2
Latent diffusion for text-to-audio and text-to-music.
Riffusion
Real-time music generation via spectrogram diffusion.
MAGNeT
Non-autoregressive music generation with masked modeling.
Audio Classification & Analysis
Classify, tag, and analyze audio content including environmental sounds and music.
AST
Audio Spectrogram Transformer for audio classification.
BEATs
Audio pre-training with acoustic tokenizers for general audio understanding.
CLAP
Contrastive Language-Audio Pretraining for text-audio matching.
PANNs
Pre-trained Audio Neural Networks for large-scale audio pattern recognition.
Speech recognition with AiModelBuilder
using AiDotNet;
// Train a speech recognition model with AiModelBuilder
var result = await new AiModelBuilder<float, float[], float>()
.ConfigureModel(new Whisper<float>(variant: "large-v3"))
.ConfigureOptimizer(new AdamOptimizer<float>())
.ConfigurePreprocessing()
.ConfigureDataLoader(audioLoader)
.BuildAsync();
var transcript = result.Predict(audioSample); Start building with Audio, Speech & TTS
All 250+ implementations are included free under Apache 2.0.