- Published on
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
language-guided-feature-modulatorshort-and-long-term-contextvideo-benchmarksquestion-answeringmultimodal-large-language-modelstask-aware-hierarchical-Q-Formervideo-understandingframe-samplingcaptioning-tasks
Center for Research in Computer Vision, University of Central Florida•Microsoft Research•
Despite advancements in multimodal large language models (MLLMs), current approaches struggle in medium-to-long video understanding due to frame and context length limitations. As a result, these models often depend on frame sampling, which risks missing key information over time and lacks...