VMVLM: Vision-Modulated Vision-Language Models for Improved Instruction Following


Research and development work conducted

by


Question: what do you see in this image?

Outputs: The image shows a referee talking to two soccer players, wearing Barcelona jerseys. The caption reads, "How many penalties would you like today? Umm, two please."
Referee: How many penalties would you like today?
Players: Umm, two please.
In the image, the referee is asking the two soccer players how many penalties they would like to take. Players respond with "Umm, two please," and the referee agrees to give them two penalties. The caption underneath the image makes a humorous reference to the number of penalties requested by the players. The scene appears to be taking place on a soccer field during a game. The players are wearing jerseys for Barcelona, which is a professional soccer team based in Spain.

Abstract

VMVLM (Vision-Modulated Vision-Language Models) is a novel architecture that enhances multimodal instruction-following capabilities through dual-stream visual feature integration. By introducing complementary visual pathways that combine traditional query-based visual encoding with direct intermediate visual feature injection, VMVLM achieves superior performance in vision-language understanding tasks. This approach leverages a frozen Vision Transformer (ViT) encoder, a Q-Former for learned visual queries, and a direct vision projection module that feeds intermediate visual representations directly into the Large Language Model (LLM). Experimental results demonstrate significant improvements in visual question answering, image captioning, and multimodal instruction following across multiple benchmarks.

1. Introduction

Vision-Language Models (VLMs) have emerged as a critical component in multimodal AI systems, enabling machines to understand and respond to instructions that involve both visual and textual information. While existing low-end approaches like BLIP-2 have shown promising results through query-based visual encoding, they often suffer from information bottlenecks that limit the richness of visual representations passed to the language model.

On the contemporary side VMVLM is a novel architecture that addresses these limitations through a dual-stream approach to visual feature integration. The key contribution is the introduction of a Vision Modulation mechanism that provides the LLM with both compressed visual queries (via Q-Former) and rich intermediate visual features (via direct projection), enabling more nuanced understanding of visual content.

2. Architecture

2.1 Overall Design

VMVLM consists of three main components:

  1. Vision Encoder: A frozen EVA-CLIP Vision Transformer that extracts hierarchical visual features
  2. Q-Former Module: A BERT-based transformer that learns visual queries for cross-modal alignment
  3. Vision Modulation Layer: A novel projection module that directly injects intermediate visual features into the LLM

2.2 Model Flow

  graph TD
    A[Input Image] --> B[Vision Encoder
EVA-CLIP ViT] B --> C[Final Layer Features] B --> D["Intermediate Layer Features"] C --> E[Q-Former
Learned Query Tokens] D --> F[Vision Projection] G[Text Input] --> E E --> H[Q-Former Output
Input for LLM] F --> I[Direct Visual Features
Visual Features for LLM] H --> J["Large Language Model
Vicuna-7B / Flan-T5-XXL"] I --> J K[Text Embeddings] --> J J --> L[Generated Response]

2.3 Vision Modulation Mechanism

The core innovation of VMVLM lies in its dual-pathway visual processing:

Pathway 1: Query-Based Encoding
  • Utilizes learnable query tokens (32 tokens) processed through Q-Former
  • Provides compressed, task-relevant visual information
  • Maintains compatibility with existing BLIP-2 architectures
Pathway 2: Direct Feature Injection
  • Extracts intermediate features from the second-to-last ViT layer
  • Projects features directly into LLM embedding space
  • Preserves fine-grained visual details often lost in query-based approaches

3. Implementation Details

3.1 Model Variants

This model has been implemented into two primary variants of VMVLM:

VMVLM-Vicuna:
  • Uses Vicuna-7B as the backbone LLM
  • Employs causal language modeling objective
  • Supports interactive dialogue and instruction following
VMVLM-FlanT5:
  • Uses Flan-T5-XXL as the backbone LLM
  • Employs encoder-decoder architecture
  • Optimized for structured question-answering tasks

3.2 Training Pipeline

  graph TD
    subgraph "Stage 1 Details"
        D[Image-Text Pairs] --> E[Vision Modulation
Learning] E --> F[Cross-modal
Alignment] end subgraph "Stage 2 Details" G[Instruction Datasets
VQA, Captioning, etc.] --> H[Task-specific
Adaptation] H --> I[End-to-end
Fine-tuning] end
  graph LR
  A[Stage 1:
Vision-Language
Pretraining] --> B[Stage 2:
Instruction
Finetuning] B --> C[Optional:
LoRA Adaptation]

3.3 Key Technical Components

Vision Encoder Configuration:
  • EVA-CLIP ViT-g with 224x224 input resolution
  • Frozen weights during training for stability
  • FP16 precision for efficiency
Q-Former Setup:
  • BERT-base architecture with cross-attention layers
  • 32 learnable query tokens
  • Text input integration capability
Vision Projection:
  • Linear projection: Vision Features → LLM Embedding Dim
  • Applied to intermediate ViT features
  • Excludes CLS token, focuses on spatial features

4. Training Methodology

4.1 Loss Function Design

The model employs carefully designed loss masking to ensure optimal training dynamics and prevent interference between different input modalities. This sophisticated masking strategy implements selective gradient computation that distinguishes between instructional content and target responses, while properly handling the dual visual pathways. The loss masking mechanism operates at the token level, applying different treatment strategies based on token type and origin within the input sequence. This approach ensures that the model learns to generate appropriate responses without being penalized for the visual or instructional components, leading to more stable training and improved instruction-following capabilities.

  • No loss applied to visual tokens (both Q-Former and direct features)
  • No loss applied to instruction text (input)
  • Loss applied only to target response tokens
  • Padding tokens excluded from loss calculation

4.2 Multi-stage Training

Stage 1: Vision-Language Pretraining
              
# Key components trained:
# - Q-Former parameters
# - LLM projection layers  
# - Vision projection module

# Frozen: ViT encoder, LLM backbone
              
            

Stage 2: Instruction Finetuning
              
# Datasets: VQA, Image Captioning, Instruction Following
# Full model fine-tuning with task-specific prompts
# Supports both generative and discriminative tasks
              
            

4.3 Inference Pipeline

  sequenceDiagram
    participant U as User
    participant M as VMVLM
    participant V as Vision Encoder
    participant Q as Q-Former
    participant P as Vision Projection
    participant L as LLM
    U ->> M: Image + Question
    M ->> V: Extract visual features
    V ->> Q: Final layer features
    V ->> P: Intermediate layer features
    Q ->> L: Compressed visual queries
    P ->> L: Direct visual features
    M ->> L: Text embeddings
    L ->> M: Generated response
    M ->> U: Final answer
    

5. Experimental Setup

5.1 Supported Tasks

  • Visual Question Answering: COCO-VQA, OK-VQA, A-OKVQA
  • Image Captioning: COCO Captions, NoCaps
  • Text-rich VQA: TextVQA, OCR-VQA, DocVQA
  • Instruction Following: LLaVA instruction dataset

5.2 Evaluation Metrics

  • Generation Tasks: BLEU, ROUGE, CIDEr scores
  • Classification Tasks: Accuracy, F1-score
  • Multiple Choice: Candidate ranking via likelihood

5.3 Model Configuration

Component VMVLM-Vicuna VMVLM-FlanT5
Vision Encoder EVA-CLIP ViT-g EVA-CLIP ViT-g
LLM Backbone Vicuna-7B Flan-T5-XXL
Query Tokens 32 32
Max Text Length 128 128
Max Output Length 256 256

6. Key Innovations

6.1 Parallel Evolution

Unlike other single-pathway approaches, VMVLM maintains two complementary streams of visual information, enabling both high-level semantic understanding and fine-grained detail preservation.


6.2 Intermediate Feature Utilization

By using intermediate ViT layers rather than only final representations, the model captures multi-scale visual features that prove crucial for complex reasoning tasks.


6.3 Flexible Architecture Support

The vision modulation approach is architecture-agnostic and can be applied to both autoregressive (Vicuna) and encoder-decoder (FlanT5) language models.

7. Conclusion

7.1 Experimental Methodology

VMVLM represents a significant advancement in small and efficient vision-language modeling through its innovative dual-stream visual processing architecture. By combining the benefits of learned visual queries with direct intermediate feature injection, the model achieves superior performance across diverse multimodal tasks while maintaining computational efficiency. The architecture's flexibility and strong empirical results demonstrate its potential as a foundation for future multimodal AI systems.


7.2 Future Work

  • Temporal Modeling: Use and enhanced the video understanding capabilities
  • Efficiency Optimization: Model compression and acceleration techniques
  • Multi-Resolution Processing: Adaptive visual feature extraction
  • Domain Adaptation: Specialized variants for medical, scientific imaging etc.

BibTex

@article{
        vmvlm, 
        title={Vision-Modulated Vision-Language Models for Improved Instruction Following}, 
        author={Taneem Ullah Jan},
        year={2025}
      }