Multimodal Few-shot Visual Grounding without Fine-tuning

📑 Table of Contents

<aside> 📌

Overview

📅 2024.07 - 2024.10

👥 3 Members (including myself)

📝 TL;DR

We propose a template-based multimodal prompt learning framework for few-shot visual grounding without fine-tuning.

✅ Eliminates the need for dataset-specific retraining, improving adaptability to unseen classes.

✅ Utilizes multimodal prompts with visual and textual templates to enhance generalization.

✅ Incorporates pseudo-class embeddings and contrastive learning, achieving 83.6% accuracy on RefCOCOg, outperforming baselines.

👨‍💻 My Role (40% Contribution)

Designed the Fusion Module to enhance cross-modal interactions between visual and textual features.
Developed Learnable Embeddings to create generalized feature representations, improving adaptability to novel classes.
Applied Contrastive Learning to optimize inter-class separation and intra-class similarity, boosting retrieval accuracy.
Led the team in research direction, model implementation, and performance evaluation, ensuring the feasibility of our approach.

🔍 Problem & Solution

Traditional visual grounding models rely heavily on fine-tuning for each new category, making them inflexible for unseen objects. Existing few-shot learning approaches often struggle with generalization, resulting in poor performance on novel categories.

💡 To solve this, we introduce a Multi-modal Template-Based Learning framework that:

✅ Leverages multimodal prompts (visual + textual templates) to provide richer contextual grounding.

✅ Uses pseudo-class embeddings to create transferable representations across novel classes.

✅ Incorporates contrastive learning to maximize inter-class separation and ensure robust intra-class consistency.

</aside>

❓ Background & Problem Definition

$\color{gray}\rule{361px}{1.5px}$

🔹 Limitations of Existing Approaches

Traditional visual grounding models require extensive fine-tuning for new classes.
Few-shot learning methods often suffer from poor generalization and lack robustness when applied to unseen objects.

📉 Challenges in Few-Shot Visual Grounding

As shown in the figure below, existing Zero-shot prompting-based grounding models (Grounding DINO, GLIP, Florence-2) are actively researched, but few-shot visual grounding models remain underdeveloped.

Grounding Dino

GLIP

Florence-2

❌ Few-shot learning approaches still rely on fine-tuning, limiting adaptability to novel categories.

❌ Zero-shot prompting methods exist, but their application to few-shot grounding remains underexplored.

🔹 Research Problem

How can we design a model that performs visual grounding without fine-tuning on new categories?
Can multimodal prompting with pseudo-class embeddings enhance few-shot learning performance?

🔥 Key Contributions

$\color{gray}\rule{361px}{1.5px}$