Vision-Language Models

165+

Bridge vision and language with state-of-the-art multimodal models

The most comprehensive collection of vision-language models in any .NET library. Instruction-tuned VLMs for image understanding and conversation, vision encoders for feature extraction, plus grounding models, document understanding, and visual question answering.

Image Captioning Visual Q&A Document Understanding Visual Search Content Moderation Accessibility Medical Image Analysis Retail Product Recognition

Instruction-Tuned VLMs (46 models)

Multimodal chatbots that understand images and follow complex instructions.

LLaVA / LLaVA-NeXT

Visual instruction tuning with strong visual reasoning and conversation.

LLaVA-OneVision

Unified vision model for images, videos, and multi-image understanding.

InternVL / InternVL2

Scaling up vision foundation model with dynamic resolution and 108B scale.

CogVLM / CogVLM2

Deep visual expert integration with 6B vision-language model.

DeepSeekVL / DeepSeekVL2

Mixture-of-experts VLM with strong real-world understanding.

Qwen-VL / Qwen2-VL

Alibaba multimodal model with bounding box and multi-image support.

Phi-3-Vision / Phi-3.5-Vision

Microsoft compact VLM with strong performance at small scale.

MiniCPM-V

Efficient VLM with OCR capability in 3B parameters.

Idefics2 / Idefics3

HuggingFace open VLM with native multi-image and document support.

Molmo

Allen AI open VLM family with pointing and counting capabilities.

Pixtral

Mistral multimodal model with native image understanding.

Vision Encoders (27 models)

Extract powerful visual features for retrieval, classification, and multimodal alignment.

CLIP / OpenCLIP

Contrastive Language-Image Pre-training for zero-shot visual understanding.

SigLIP / SigLIP-2

Sigmoid loss for language-image pre-training with improved scalability.

DINOv2 / DINOv3

Self-supervised ViT producing strong universal visual features.

Florence-2

Microsoft foundation model for multiple vision tasks from a single backbone.

EVA-CLIP

Improved CLIP with EVA pre-training for better visual representations.

MetaCLIP

CLIP trained with metadata-curated data for improved distribution balance.

BLIP / BLIP-2

Bootstrapping Language-Image Pre-training with Q-Former bridging module.

ALIGN

Google large-scale noisy image-text alignment for visual representation.

InternViT

Scaling vision transformer to 6B parameters for universal visual encoding.

Visual Grounding & Detection

Locate objects in images using natural language descriptions.

Grounding DINO

Open-set object detection with text prompts for any object category.

KOSMOS-2

Microsoft multimodal model with grounding and referring capabilities.

GLaMM

Grounding Large Multimodal Model for pixel-level understanding.

Ferret

Apple model for referring and grounding in visual conversations.

CogAgent

Visual agent for GUI understanding and web interaction.

Document Understanding

Extract structure, text, and semantics from documents, forms, and tables.

LayoutLMv3

Pre-trained model for document AI combining text, layout, and image.

DocFormer

Multi-modal transformer for document understanding tasks.

Donut

OCR-free document understanding transformer for end-to-end extraction.

Nougat

Neural Optical Understanding for Academic documents (PDF-to-Markdown).

GOT-OCR

General OCR Theory model for unified optical character recognition.

Vision-language with AiModelBuilder

using AiDotNet;

// Train a vision-language model with AiModelBuilder
var result = await new AiModelBuilder<float, float[], float>()
    .ConfigureModel(new LLaVA<float>(variant: "v1.6-mistral-7b"))
    .ConfigureOptimizer(new AdamOptimizer<float>())
    .ConfigurePreprocessing()
    .ConfigureDataLoader(multimodalLoader)
    .BuildAsync();

var description = result.Predict(imageFeatures);

Start building with Vision-Language Models

All 165+ implementations are included free under Apache 2.0.

Install AiDotNet Browse All Features