DVANet: Disentangling View and Action Features for Mutli-View Action Recognition

Center for Research in Computer Vision (CRCV), University of Central Florida

Abstract

In this work, we present a novel approach to multi-view action recognition where we guide learned action representations to be separated from view-relevant information in a video. When trying to classify action instances captured from multiple viewpoints, there is a higher degree of difficulty due to the difference in background, occlusion, and visibility of the captured action from different camera angles. To tackle the various problems introduced in multi-view action recognition, we propose a novel configuration of learnable transformer decoder queries, in conjunction with two supervised contrastive losses, to enforce the learning of action features that are robust to shifts in viewpoints. Our disentangled feature learning occurs in two stages: the transformer decoder uses separate queries to separately learn action and view information, which are then further disentangled using our two contrastive losses. We show that our model and method of training significantly outperforms all other uni-modal models on four multi-view action recognition datasets: NTU RGB+D, NTU RGB+D 120, PKU-MMD, and N-UCLA. Compared to previous RGB works, we see maximal improvements of 1.5%, 4.8%, 2.2%, and 4.8% on each dataset, respectively.

Video

Paper Details

Method Diagram

First, our spatio-temporal encoder extracts global features from a video. A transformer decoder is used to separately extract action and view features with the help of learnable action and view queries. In addition to the classification loss to predict the action and view, we also introduce contrastive loss to learn disentangled view and action representations.

Disentangled Representation Learning

Conceptual visualization of learned action and view features plotted on the x- and y-axes, respectively, from different learning objectives. a) Traditional feature learning for multi-view action recognition may lead to view features becoming entangled in the action embedding space, causing different action classes to erroneously cluster due to similar viewpoints. b) View-invariant action features will properly cluster by action class, but the disentangled view features do not cluster properly. c) Our method disentangles the view features from the action features while still retaining the structure of both embedding spaces, improving performance on unseen viewpoints.

Results

Results on four benchmark multi-view action recognition models. DVANet outperforms all uni-modal models, including both RGB- and skeleton-based multi-view action recognition works.

Below are qualitative results exhibiting the learned disentangled features of DVANet. Additional results are also provided to show how DVANet's learned embedding space aids in action recognition performance on unseen viewpoints when compared to previous works.

Conclusion

In this paper, we propose a novel transformer decoder-based architecture in tandem with two supervised contrastive losses for multiview action recognition. By disentangling the view-relevant features from action-relevant features, we enable our model to learn action features that are robust to change in viewpoints. We show through various ablations, analyses, and visualizations that changes in viewpoint impart perturbations on learned action features. Thus, disentangling these perturbations improves overall action recognition performance. Uni-modal state-of-the-art performance is attained on four large-scale multi-view action recognition datasets, highlighting the efficacy of our method.

For more technical details and results, check out our attached main paper.

BibTeX

@article{siddiqui2023dvanet,
  title={DVANet: Disentangling View and Action Features for Multi-View Action Recognition},
  author={Siddiqui, Nyle and Tirupattur, Praveen and Shah, Mubarak},
  journal={AAAI},
  year={2024}
}
}