AI Heap
Published on

Scaling 4D Representations

arXiv:2412.15212 - [arXiv,PDF]
Authors
  • Name
    Jo\~ao Carreira
  • Name
    Dilara Gokay
  • Name
    Michael King
  • Name
    Chuhan Zhang
  • Name
    Ignacio Rocco
  • Name
    Aravindh Mahendran
  • Name
    Thomas Albert Keck
  • Name
    Joseph Heyward
  • Name
    Skanda Koppula
  • Name
    Etienne Pot
  • Name
    Goker Erdogan
  • Name
    Yana Hasson
  • Name
    Yi Yang
  • Name
    Klaus Greff
  • Name
    Guillaume Le Moing
  • Name
    Sjoerd van Steenkiste
  • Name
    Daniel Zoran
  • Name
    Drew A. Hudson
  • Name
    Pedro V\'elez
  • Name
    Luisa Polan\'ia
  • Name
    Luke Friedman
  • Name
    Chris Duvarney
  • Name
    Ross Goroshin
  • Name
    Kelsey Allen
  • Name
    Jacob Walker
  • Name
    Rishabh Kabra
  • Name
    Eric Aboussouan
  • Name
    Jennifer Sun
  • Name
    Thomas Kipf
  • Name
    Carl Doersch
  • Name
    Viorica P\u{a}tr\u{a}ucean
  • Name
    Dima Damen
  • Name
    Pauline Luc
  • Name
    Mehdi S. M. Sajjadi
  • Name
    Andrew Zisserman
  • Affiliation
  • Affiliation
    Unknown
  • Affiliation
    Google DeepMind
  • Affiliation
    University of Oxford
Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode{x2013}$ action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model $\unicode{x2013}$ 22B parameters. Rigorous apples-to-apples comparison with many recent image and video models demonstrates the benefits of scaling 4D representations.