Question: what do you see in this image?
Outputs: The image shows a referee talking to two soccer players, wearing Barcelona jerseys.
The caption reads, "How many penalties would you like today? Umm, two please."
Referee: How many penalties would you like today?
Players: Umm, two please.
In the image, the referee is asking the two soccer players how many penalties they would like to take.
Players respond with "Umm, two please," and the referee agrees to give them two penalties. The caption
underneath the image makes a humorous reference to the number of penalties requested by the players. The
scene appears to be taking place on a soccer field during a game. The players are wearing jerseys for
Barcelona, which is a professional soccer team based in Spain.
VMVLM (Vision-Modulated Vision-Language Models) is a novel architecture that enhances multimodal instruction-following capabilities through dual-stream visual feature integration. By introducing complementary visual pathways that combine traditional query-based visual encoding with direct intermediate visual feature injection, VMVLM achieves superior performance in vision-language understanding tasks. This approach leverages a frozen Vision Transformer (ViT) encoder, a Q-Former for learned visual queries, and a direct vision projection module that feeds intermediate visual representations directly into the Large Language Model (LLM). Experimental results demonstrate significant improvements in visual question answering, image captioning, and multimodal instruction following across multiple benchmarks.
Vision-Language Models (VLMs) have emerged as a critical component in multimodal AI systems, enabling
machines to understand and respond to instructions that involve both visual and textual information. While
existing low-end approaches like BLIP-2 have shown promising results through query-based visual encoding,
they often suffer from information bottlenecks that limit the richness of visual representations passed to
the language model.
On the contemporary side VMVLM is a novel architecture that addresses these limitations through a
dual-stream approach to visual feature integration. The key contribution is the introduction of a Vision
Modulation mechanism that provides the LLM with both compressed visual queries (via Q-Former) and rich
intermediate visual features (via direct projection), enabling more nuanced understanding of visual
content.
VMVLM consists of three main components:
graph TD A[Input Image] --> B[Vision Encoder
EVA-CLIP ViT] B --> C[Final Layer Features] B --> D["Intermediate Layer Features"] C --> E[Q-Former
Learned Query Tokens] D --> F[Vision Projection] G[Text Input] --> E E --> H[Q-Former Output
Input for LLM] F --> I[Direct Visual Features
Visual Features for LLM] H --> J["Large Language Model
Vicuna-7B / Flan-T5-XXL"] I --> J K[Text Embeddings] --> J J --> L[Generated Response]
The core innovation of VMVLM lies in its dual-pathway visual processing:
Pathway 1: Query-Based EncodingThis model has been implemented into two primary variants of VMVLM:
VMVLM-Vicuna:
graph TD subgraph "Stage 1 Details" D[Image-Text Pairs] --> E[Vision Modulation
Learning] E --> F[Cross-modal
Alignment] end subgraph "Stage 2 Details" G[Instruction Datasets
VQA, Captioning, etc.] --> H[Task-specific
Adaptation] H --> I[End-to-end
Fine-tuning] end
graph LR A[Stage 1:
Vision-Language
Pretraining] --> B[Stage 2:
Instruction
Finetuning] B --> C[Optional:
LoRA Adaptation]
The model employs carefully designed loss masking to ensure optimal training dynamics and prevent interference between different input modalities. This sophisticated masking strategy implements selective gradient computation that distinguishes between instructional content and target responses, while properly handling the dual visual pathways. The loss masking mechanism operates at the token level, applying different treatment strategies based on token type and origin within the input sequence. This approach ensures that the model learns to generate appropriate responses without being penalized for the visual or instructional components, leading to more stable training and improved instruction-following capabilities.
# Key components trained:
# - Q-Former parameters
# - LLM projection layers
# - Vision projection module
# Frozen: ViT encoder, LLM backbone
# Datasets: VQA, Image Captioning, Instruction Following
# Full model fine-tuning with task-specific prompts
# Supports both generative and discriminative tasks
sequenceDiagram participant U as User participant M as VMVLM participant V as Vision Encoder participant Q as Q-Former participant P as Vision Projection participant L as LLM U ->> M: Image + Question M ->> V: Extract visual features V ->> Q: Final layer features V ->> P: Intermediate layer features Q ->> L: Compressed visual queries P ->> L: Direct visual features M ->> L: Text embeddings L ->> M: Generated response M ->> U: Final answer
Component | VMVLM-Vicuna | VMVLM-FlanT5 |
---|---|---|
Vision Encoder | EVA-CLIP ViT-g | EVA-CLIP ViT-g |
LLM Backbone | Vicuna-7B | Flan-T5-XXL |
Query Tokens | 32 | 32 |
Max Text Length | 128 | 128 |
Max Output Length | 256 | 256 |
Unlike other single-pathway approaches, VMVLM maintains two complementary streams of visual information, enabling both high-level semantic understanding and fine-grained detail preservation.
By using intermediate ViT layers rather than only final representations, the model captures multi-scale visual features that prove crucial for complex reasoning tasks.
The vision modulation approach is architecture-agnostic and can be applied to both autoregressive (Vicuna) and encoder-decoder (FlanT5) language models.
VMVLM represents a significant advancement in small and efficient vision-language modeling through its innovative dual-stream visual processing architecture. By combining the benefits of learned visual queries with direct intermediate feature injection, the model achieves superior performance across diverse multimodal tasks while maintaining computational efficiency. The architecture's flexibility and strong empirical results demonstrate its potential as a foundation for future multimodal AI systems.
@article{
vmvlm,
title={Vision-Modulated Vision-Language Models for Improved Instruction Following},
author={Taneem Ullah Jan},
year={2025}
}