Spatiotemporal Interaction Residual Networks with Pseudo3D for Video Action Recognition.

Abstract

Action recognition is a significant and challenging topic in the field of sensor and computer vision. Two-stream convolutional neural networks (CNNs) and 3D CNNs are two mainstream deep learning architectures for video action recognition. To combine them into one framework to further improve performance, we proposed a novel deep network, named the spatiotemporal interaction residual network with pseudo3D (STINP). The STINP possesses three advantages. First, the STINP consists of two branches constructed based on residual networks (ResNets) to simultaneously learn the spatial and temporal information of the video. Second, the STINP integrates the pseudo3D block into residual units for building the spatial branch, which ensures that the spatial branch can not only learn the appearance feature of the objects and scene in the video, but also capture the potential interaction information among the consecutive frames. Finally, the STINP adopts a simple but effective multiplication operation to fuse the spatial branch and temporal branch, which guarantees that the learned spatial and temporal representation can interact with each other during the entire process of training the STINP. Experiments were implemented on two classic action recognition datasets, UCF101 and HMDB51. The experimental results show that our proposed STINP can provide better performance for video recognition than other state-of-the-art algorithms.

Show Full Text

Spatiotemporal Interaction Residual Networks with Pseudo3D for Video Action Recognition.

Researchers

Journal

Modalities

Models

Abstract

Toward Robust Fault Identification of Complex Industrial Processes Using Stacked Sparse-Denoising Autoencoder With Softmax Classifier.

Unsupervised Domain Adaptation for Image Classification and Object Detection Using Guided Transfer Learning Approach and JS Divergence.

Balancing Heterogeneous Image Quality for Improved Cross-Spectral Face Recognition.

Towards Generating and Evaluating Iconographic Image Captions of Artworks.

EM-YOLO: An X-ray Prohibited-Item-Detection Method Based on Edge and Material Information Fusion.

Image-based facial emotion recognition using convolutional neural network on emognition dataset.

Leave a Reply Cancel reply

Researchers

Journal

Modalities

Models

Abstract

Similar Posts

Leave a Reply Cancel reply