AI Heap
Published on

MultiTSF: Transformer-based Sensor Fusion for Human-Centric Multi-view and Multi-modal Action Recognition

arXiv:2504.02279 - [arXiv,PDF]
Authors
  • Name
    Trung Thanh Nguyen
  • Name
    Yasutomo Kawanishi
  • Name
    Vijay John
  • Name
    Takahiro Komamizu
  • Name
    Ichiro Ide
  • Affiliation
    Department of Chemical Engineering, University of XYZ
  • Affiliation
    Department of Materials Science, University of ABC
  • Affiliation
    Department of Chemical Engineering, University of DEF
  • Affiliation
    Department of Physics, University of GHI
  • Affiliation
    Department of Chemistry, University of JKL
Action recognition from multi-modal and multi-view observations holds significant potential for applications in surveillance, robotics, and smart environments. However, existing methods often fall short of addressing real-world challenges such as diverse environmental conditions, strict sensor synchronization, and the need for fine-grained annotations. In this study, we propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF). The proposed method leverages a Transformer-based to dynamically model inter-view relationships and capture temporal dependencies across multiple views. Additionally, we introduce a Human Detection Module to generate pseudo-ground-truth labels, enabling the model to prioritize frames containing human activity and enhance spatial feature learning. Comprehensive experiments conducted on our in-house MultiSensor-Home dataset and the existing MM-Office dataset demonstrate that MultiTSF outperforms state-of-the-art methods in both video sequence-level and frame-level action recognition settings.