Vision-Language Models
165+Bridge vision and language with state-of-the-art multimodal models
The most comprehensive collection of vision-language models in any .NET library. Instruction-tuned VLMs for image understanding and conversation, vision encoders for feature extraction, plus grounding models, document understanding, and visual question answering.
Instruction-Tuned VLMs (46 models)
Multimodal chatbots that understand images and follow complex instructions.
LLaVA / LLaVA-NeXT
Visual instruction tuning with strong visual reasoning and conversation.
LLaVA-OneVision
Unified vision model for images, videos, and multi-image understanding.
InternVL / InternVL2
Scaling up vision foundation model with dynamic resolution and 108B scale.
CogVLM / CogVLM2
Deep visual expert integration with 6B vision-language model.
DeepSeekVL / DeepSeekVL2
Mixture-of-experts VLM with strong real-world understanding.
Qwen-VL / Qwen2-VL
Alibaba multimodal model with bounding box and multi-image support.
Phi-3-Vision / Phi-3.5-Vision
Microsoft compact VLM with strong performance at small scale.
MiniCPM-V
Efficient VLM with OCR capability in 3B parameters.
Idefics2 / Idefics3
HuggingFace open VLM with native multi-image and document support.
Molmo
Allen AI open VLM family with pointing and counting capabilities.
Pixtral
Mistral multimodal model with native image understanding.
Vision Encoders (27 models)
Extract powerful visual features for retrieval, classification, and multimodal alignment.
CLIP / OpenCLIP
Contrastive Language-Image Pre-training for zero-shot visual understanding.
SigLIP / SigLIP-2
Sigmoid loss for language-image pre-training with improved scalability.
DINOv2 / DINOv3
Self-supervised ViT producing strong universal visual features.
Florence-2
Microsoft foundation model for multiple vision tasks from a single backbone.
EVA-CLIP
Improved CLIP with EVA pre-training for better visual representations.
MetaCLIP
CLIP trained with metadata-curated data for improved distribution balance.
BLIP / BLIP-2
Bootstrapping Language-Image Pre-training with Q-Former bridging module.
ALIGN
Google large-scale noisy image-text alignment for visual representation.
InternViT
Scaling vision transformer to 6B parameters for universal visual encoding.
Visual Grounding & Detection
Locate objects in images using natural language descriptions.
Grounding DINO
Open-set object detection with text prompts for any object category.
KOSMOS-2
Microsoft multimodal model with grounding and referring capabilities.
GLaMM
Grounding Large Multimodal Model for pixel-level understanding.
Ferret
Apple model for referring and grounding in visual conversations.
CogAgent
Visual agent for GUI understanding and web interaction.
Document Understanding
Extract structure, text, and semantics from documents, forms, and tables.
LayoutLMv3
Pre-trained model for document AI combining text, layout, and image.
DocFormer
Multi-modal transformer for document understanding tasks.
Donut
OCR-free document understanding transformer for end-to-end extraction.
Nougat
Neural Optical Understanding for Academic documents (PDF-to-Markdown).
GOT-OCR
General OCR Theory model for unified optical character recognition.
Vision-language with AiModelBuilder
using AiDotNet;
// Train a vision-language model with AiModelBuilder
var result = await new AiModelBuilder<float, float[], float>()
.ConfigureModel(new LLaVA<float>(variant: "v1.6-mistral-7b"))
.ConfigureOptimizer(new AdamOptimizer<float>())
.ConfigurePreprocessing()
.ConfigureDataLoader(multimodalLoader)
.BuildAsync();
var description = result.Predict(imageFeatures); Start building with Vision-Language Models
All 165+ implementations are included free under Apache 2.0.