- Published on
FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning
robustness-evaluationstate-of-the-art-performanceFortisAVQA-datasetAudio-Visual-Question-Answeringdebiasing-strategyMAVENdataset-biasesmultimodal-reasoning
Ministry of Education of Key Laboratory for Intelligent Networks and Network Security, School of Cyber Science and Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China•Shannxi Provincial Key Laboratory of Big Data Knowledge Engineering, School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China•Information Hub, Hong Kong University of Science and Technology (Guangzhou), Guangzhou, Guangdong, 510000, China•
Audio-Visual Question Answering (AVQA) is a challenging multimodal reasoning task requiring intelligent systems to answer natural language queries based on paired audio-video inputs accurately. However, existing AVQA approaches often suffer from overfitting to dataset biases, leading to poor...