<aside> 📌
📅 2024.07 - 2024.10
👥 3 Members (including myself)
We propose a template-based multimodal prompt learning framework for few-shot visual grounding without fine-tuning.
✅ Eliminates the need for dataset-specific retraining, improving adaptability to unseen classes.
✅ Utilizes multimodal prompts with visual and textual templates to enhance generalization.
✅ Incorporates pseudo-class embeddings and contrastive learning, achieving 83.6% accuracy on RefCOCOg, outperforming baselines.
Traditional visual grounding models rely heavily on fine-tuning for each new category, making them inflexible for unseen objects. Existing few-shot learning approaches often struggle with generalization, resulting in poor performance on novel categories.
💡 To solve this, we introduce a Multi-modal Template-Based Learning framework that:
✅ Leverages multimodal prompts (visual + textual templates) to provide richer contextual grounding.
✅ Uses pseudo-class embeddings to create transferable representations across novel classes.
✅ Incorporates contrastive learning to maximize inter-class separation and ensure robust intra-class consistency.
</aside>
$\color{gray}\rule{361px}{1.5px}$
📉 Challenges in Few-Shot Visual Grounding
As shown in the figure below, existing Zero-shot prompting-based grounding models (Grounding DINO, GLIP, Florence-2) are actively researched, but few-shot visual grounding models remain underdeveloped.
Grounding Dino
GLIP
Florence-2
❌ Few-shot learning approaches still rely on fine-tuning, limiting adaptability to novel categories.
❌ Zero-shot prompting methods exist, but their application to few-shot grounding remains underexplored.
$\color{gray}\rule{361px}{1.5px}$