<aside> 📌

Overview

📅 2024.07 - 2024.10

👥 3 Members (including myself)

📝 TL;DR

We propose a template-based multimodal prompt learning framework for few-shot visual grounding without fine-tuning.

Eliminates the need for dataset-specific retraining, improving adaptability to unseen classes.

Utilizes multimodal prompts with visual and textual templates to enhance generalization.

Incorporates pseudo-class embeddings and contrastive learning, achieving 83.6% accuracy on RefCOCOg, outperforming baselines.

👨‍💻 My Role (40% Contribution)


🔍 Problem & Solution

Traditional visual grounding models rely heavily on fine-tuning for each new category, making them inflexible for unseen objects. Existing few-shot learning approaches often struggle with generalization, resulting in poor performance on novel categories.

💡 To solve this, we introduce a Multi-modal Template-Based Learning framework that:

Leverages multimodal prompts (visual + textual templates) to provide richer contextual grounding.

Uses pseudo-class embeddings to create transferable representations across novel classes.

Incorporates contrastive learning to maximize inter-class separation and ensure robust intra-class consistency.

</aside>

Background & Problem Definition

$\color{gray}\rule{361px}{1.5px}$

🔹 Limitations of Existing Approaches

📉 Challenges in Few-Shot Visual Grounding

As shown in the figure below, existing Zero-shot prompting-based grounding models (Grounding DINO, GLIP, Florence-2) are actively researched, but few-shot visual grounding models remain underdeveloped.

Grounding Dino

Grounding Dino

GLIP

GLIP

Florence-2

Florence-2

Few-shot learning approaches still rely on fine-tuning, limiting adaptability to novel categories.

Zero-shot prompting methods exist, but their application to few-shot grounding remains underexplored.

🔹 Research Problem

🔥 Key Contributions

$\color{gray}\rule{361px}{1.5px}$