In this work, we present a novel approach to multi-view action recognition where we guide learned action representations to be separated from view-relevant information in a video. When trying to classify action instances captured from multiple viewpoints, there is a higher degree of difficulty due to the difference in background, occlusion, and visibility of the captured action from different camera angles. To tackle the various problems introduced in multi-view action recognition, we propose a novel configuration of learnable transformer decoder queries, in conjunction with two supervised contrastive losses, to enforce the learning of action features that are robust to shifts in viewpoints. Our disentangled feature learning occurs in two stages: the transformer decoder uses separate queries to separately learn action and view information, which are then further disentangled using our two contrastive losses. We show that our model and method of training significantly outperforms all other uni-modal models on four multi-view action recognition datasets: NTU RGB+D, NTU RGB+D 120, PKU-MMD, and N-UCLA. Compared to previous RGB works, we see maximal improvements of 1.5%, 4.8%, 2.2%, and 4.8% on each dataset, respectively.
Results on four benchmark multi-view action recognition models. DVANet outperforms all uni-modal models, including both RGB- and skeleton-based multi-view action recognition works.
Below are qualitative results exhibiting the learned disentangled features of DVANet. Additional results are also provided to show how DVANet's learned embedding space aids in action recognition performance on unseen viewpoints when compared to previous works.
In this paper, we propose a novel transformer decoder-based architecture in tandem with two supervised contrastive losses for multiview action recognition. By disentangling the view-relevant features from action-relevant features, we enable our model to learn action features that are robust to change in viewpoints. We show through various ablations, analyses, and visualizations that changes in viewpoint impart perturbations on learned action features. Thus, disentangling these perturbations improves overall action recognition performance. Uni-modal state-of-the-art performance is attained on four large-scale multi-view action recognition datasets, highlighting the efficacy of our method.
For more technical details and results, check out our attached main paper.
@article{siddiqui2023dvanet,
title={DVANet: Disentangling View and Action Features for Multi-View Action Recognition},
author={Siddiqui, Nyle and Tirupattur, Praveen and Shah, Mubarak},
journal={AAAI},
year={2024}
}
}