Audio, Speech & TTS

250+

The most comprehensive audio AI pipeline in any .NET library

From Whisper speech recognition to VITS text-to-speech, from MusicGen generation to RVC voice cloning. AiDotNet covers the entire audio AI pipeline - speech recognition, text-to-speech, music generation, audio classification, voice conversion, and more - all in pure C#.

Voice Assistants Transcription Services Audiobook Generation Music Production Podcast Enhancement Accessibility Tools Call Center Analytics Content Localization

Speech Recognition (ASR)

Convert speech to text with state-of-the-art accuracy across 100+ languages.

Whisper v3

OpenAI Whisper with 100+ language support and robust accuracy.

wav2vec 2.0

Self-supervised speech representation learning from raw audio.

HuBERT

Hidden-Unit BERT with offline clustering for speech representations.

WavLM

Large-scale self-supervised speech model with denoising pretraining.

Conformer

Convolution-augmented transformer for accurate ASR.

Canary

NVIDIA multi-language ASR with fast-conformer architecture.

USM

Universal Speech Model for 300+ languages.

SeamlessM4T

Meta translation model for speech and text across 100 languages.

Text-to-Speech (TTS)

Generate natural-sounding speech from text with voice control and emotion.

VITS / VITS2

End-to-end TTS with variational inference and adversarial training.

Tacotron 2

Sequence-to-sequence TTS with location-sensitive attention.

FastSpeech 2

Non-autoregressive TTS with duration, pitch, and energy control.

NaturalSpeech

Human-level TTS with variational autoencoder and flow matching.

StyleTTS 2

Style-controllable TTS with diffusion and adversarial training.

XTTS

Cross-lingual TTS with voice cloning from 6-second reference.

Bark

Transformer-based TTS with music, sound effects, and multilingual support.

Tortoise TTS

High-quality multi-voice TTS with CLVP and diffusion decoder.

Piper

Fast, lightweight TTS optimized for edge and offline use.

MetaVoice

Foundation model for human-like, emotional speech generation.

Voice Conversion & Cloning

Transform voice characteristics while preserving content and emotion.

RVC

Retrieval-based Voice Conversion with high quality and real-time speed.

So-VITS-SVC

Singing voice conversion with VITS backbone.

OpenVoice

Instant voice cloning with fine-grained style control.

VALL-E

Neural codec language model for zero-shot voice cloning.

VALL-E X

Cross-lingual speech synthesis with zero-shot voice cloning.

FreeVC

Text-free voice conversion without parallel data.

Music & Sound Generation

Generate music, sound effects, and audio from text descriptions.

MusicGen

Meta text-to-music generation with melody conditioning.

AudioCraft

Meta audio generation suite for music, sound effects, and compression.

Stable Audio

Stability AI diffusion-based audio and music generation.

AudioLDM / AudioLDM2

Latent diffusion for text-to-audio and text-to-music.

Riffusion

Real-time music generation via spectrogram diffusion.

MAGNeT

Non-autoregressive music generation with masked modeling.

Audio Classification & Analysis

Classify, tag, and analyze audio content including environmental sounds and music.

AST

Audio Spectrogram Transformer for audio classification.

BEATs

Audio pre-training with acoustic tokenizers for general audio understanding.

CLAP

Contrastive Language-Audio Pretraining for text-audio matching.

PANNs

Pre-trained Audio Neural Networks for large-scale audio pattern recognition.

Speech recognition with AiModelBuilder

C#
using AiDotNet;

// Train a speech recognition model with AiModelBuilder
var result = await new AiModelBuilder<float, float[], float>()
    .ConfigureModel(new Whisper<float>(variant: "large-v3"))
    .ConfigureOptimizer(new AdamOptimizer<float>())
    .ConfigurePreprocessing()
    .ConfigureDataLoader(audioLoader)
    .BuildAsync();

var transcript = result.Predict(audioSample);

Start building with Audio, Speech & TTS

All 250+ implementations are included free under Apache 2.0.