AI Heap
Published on

Re-thinking Temporal Search for Long-Form Video Understanding

arXiv:2504.02259 - [arXiv,PDF]
Authors
  • Name
    Jinhui Ye
  • Name
    Zihan Wang
  • Name
    Haosen Sun
  • Name
    Keshigeyan Chandrasegaran
  • Name
    Zane Durante
  • Name
    Cristobal Eyzaguirre
  • Name
    Yonatan Bisk
  • Name
    Juan Carlos Niebles
  • Name
    Ehsan Adeli
  • Name
    Li Fei-Fei
  • Name
    Jiajun Wu
  • Name
    Manling Li
  • Affiliation
Efficient understanding of long-form videos remains a significant challenge in computer vision. In this work, we revisit temporal search paradigms for long-form video understanding, studying a fundamental issue pertaining to all state-of-the-art (SOTA) long-context vision-language models (VLMs). In particular, our contributions are two-fold: First, we formulate temporal search as a Long Video Haystack problem, i.e., finding a minimal set of relevant frames (typically one to five) among tens of thousands of frames from real-world long videos given specific queries. To validate our formulation, we create LV-Haystack, the first benchmark containing 3,874 human-annotated instances with fine-grained evaluation metrics for assessing keyframe search quality and computational efficiency. Experimental results on LV-Haystack highlight a significant research gap in temporal search capabilities, with SOTA keyframe selection methods achieving only 2.1% temporal F1 score on the LVBench subset. Next, inspired by visual search in images, we re-think temporal searching and propose a lightweight keyframe searching framework, T*, which casts the expensive temporal search as a spatial search problem. T* leverages superior visual localization capabilities typically used in images and introduces an adaptive zooming-in mechanism that operates across both temporal and spatial dimensions. Our extensive experiments show that when integrated with existing methods, T* significantly improves SOTA long-form video understanding performance. Specifically, under an inference budget of 32 frames, T* improves GPT-4o’s performance from 50.5% to 53.1% and LLaVA-OneVision-72B’s performance from 56.5% to 62.4% on LongVideoBench XL subset. Our PyTorch code, benchmark dataset and models are included in the Supplementary material.