AI Heap
Published on

OmniCam: Unified Multimodal Video Generation via Camera Control

arXiv:2504.02312 - [arXiv,PDF]
Authors
  • Name
    Xiaoda Yang
  • Name
    Jiayang Xu
  • Name
    Kaixuan Luan
  • Name
    Xinyu Zhan
  • Name
    Hongshun Qiu
  • Name
    Shijun Shi
  • Name
    Hao Li
  • Name
    Shuai Yang
  • Name
    Li Zhang
  • Name
    Checheng Yu
  • Name
    Cewu Lu
  • Name
    Lixin Yang
  • Affiliation
    Zhejiang University
  • Affiliation
    Shanghai Jiao Tong University
  • Affiliation
    Beijing University of Technology
  • Affiliation
    Jiangnan University
  • Affiliation
    University of Science and Technology of China
  • Affiliation
    Nanjing University
Camera control, which achieves diverse visual effects by changing camera position and pose, has attracted widespread attention. However, existing methods face challenges such as complex interaction and limited control capabilities. To address these issues, we present OmniCam, a unified multimodal camera control framework. Leveraging large language models and video diffusion models, OmniCam generates spatio-temporally consistent videos. It supports various combinations of input modalities: the user can provide text or video with expected trajectory as camera path guidance, and image or video as content reference, enabling precise control over camera motion. To facilitate the training of OmniCam, we introduce the OmniTr dataset, which contains a large collection of high-quality long-sequence trajectories, videos, and corresponding descriptions. Experimental results demonstrate that our model achieves state-of-the-art performance in high-quality camera-controlled video generation across various metrics.