<aside> 📌
📅 2024.11 - 2025.01
👥 2 Members (including myself)
We propose a Prompt Learning-based framework for robust audio-visual classification under uncertain missing modality conditions.
✅ Dynamically adapts to missing modality scenarios without requiring additional fine-tuning.
✅ Leverages learnable prompts to compensate for missing features and enhance multi-modal fusion.
✅ Outperforms fine-tuning, reducing memory usage by 82.3% while maintaining classification accuracy.
Traditional audio-visual classification models assume both modalities are always available, but real-world scenarios often involve missing or degraded modalities due to noise, sensor failure, or transmission issues.
❌ Existing methods degrade significantly when one modality is missing or unreliable.
❌ Fine-tuning for every missing modality scenario is computationally expensive and impractical.
💡 To address these issues, we introduce a Prompt Learning-based framework that:
✅ Adapts dynamically to missing modality scenarios without retraining.
✅ Uses learnable prompts to compensate for missing modality information.
✅ Maintains classification accuracy while significantly reducing computational cost.
</aside>
$\color{gray}\rule{361px}{1.5px}$
📉 Performance Drop with Missing Modalities
As shown in the figure below, **CAV-MAE** exhibits a significant performance drop when one or both modalities are missing, making it unsuitable for real-world applications where modality loss is common
❌ Audio-Only performance drops to 0.69, indicating that missing visual information reduces classification reliability.
❌ Vision-Only achieves 0.83,but is still lower than the complete setting.
❌ When both modalities are noisy, performance further declines to 0.71, highlighting the model's instability.
$\color{gray}\rule{361px}{1.5px}$