<aside> 📌

Overview

📅 2024.11 - 2025.01

👥 2 Members (including myself)

📝 TL;DR

We propose a Prompt Learning-based framework for robust audio-visual classification under uncertain missing modality conditions.

Dynamically adapts to missing modality scenarios without requiring additional fine-tuning.

Leverages learnable prompts to compensate for missing features and enhance multi-modal fusion.

Outperforms fine-tuning, reducing memory usage by 82.3% while maintaining classification accuracy.

👨‍💻 My Role (50% Contribution)


🔍 Problem & Solution

Traditional audio-visual classification models assume both modalities are always available, but real-world scenarios often involve missing or degraded modalities due to noise, sensor failure, or transmission issues.

Existing methods degrade significantly when one modality is missing or unreliable.

Fine-tuning for every missing modality scenario is computationally expensive and impractical.

💡 To address these issues, we introduce a Prompt Learning-based framework that:

Adapts dynamically to missing modality scenarios without retraining.

Uses learnable prompts to compensate for missing modality information.

Maintains classification accuracy while significantly reducing computational cost.

</aside>

Background & Problem Definition

$\color{gray}\rule{361px}{1.5px}$

🔹 Limitations of Existing Approaches

📉 Performance Drop with Missing Modalities

As shown in the figure below, **CAV-MAE** exhibits a significant performance drop when one or both modalities are missing, making it unsuitable for real-world applications where modality loss is common

image.png

Audio-Only performance drops to 0.69, indicating that missing visual information reduces classification reliability.

Vision-Only achieves 0.83,but is still lower than the complete setting.

❌ When both modalities are noisy, performance further declines to 0.71, highlighting the model's instability.

🔹 Research Problem

🔥 Key Contributions

$\color{gray}\rule{361px}{1.5px}$