Integrating 3D convolutional neural networks and transformer for video action recognition

Yan Cheng; Lingfeng Wan

doi:10.1117/12.3033760

13 June 2024 Integrating 3D convolutional neural networks and transformer for video action recognition

Yan Cheng, Lingfeng Wan

Proceedings Volume 13180, International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2024); 131805D (2024) https://doi.org/10.1117/12.3033760
Event: International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2024), 2024, Guangzhou, China

Abstract

In the field of video action recognition, the challenge of efficiently extracting video features while ensuring computational efficiency has been addressed in our study. We propose a novel video action recognition model named 3D ResNet-Transformer that integrates 3D ResNet (Residual Networks) with Transformer architecture. Utilizing 3D ResNet as the foundation, our model effectively captures spatial features of videos through its deep network structure. Additionally, the integration of Transformer encoding layers enhances temporal-spatial correlations between video features via its self-attention mechanism, thereby improving recognition accuracy. Our design synergizes the strengths of 3D ResNet and Transformer, combining their powerful capabilities effectively. Experimental results on standard video action recognition datasets, HMDB51 and UCF101, demonstrate superior performance of our model, with accuracy improvements of 3.4% and 0.4% over baseline models, achieving TOP-1 accuracies of 82.1% and 97.4%, respectively. This research validates the effectiveness and innovation of our integrated 3D ResNet and Transformer model in enhancing video recognition accuracy.

(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.

Citation Download Citation

Yan Cheng and Lingfeng Wan "Integrating 3D convolutional neural networks and transformer for video action recognition", Proc. SPIE 13180, International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2024), 131805D (13 June 2024); https://doi.org/10.1117/12.3033760

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available

Members: $17.00

Non-members: $21.00 ADD TO CART

PROCEEDINGS
6 PAGES

DOWNLOAD PAPER SAVE TO MY LIBRARY

GET CITATION

RIGHTS & PERMISSIONS

Get copyright permission Get copyright permission on Copyright Marketplace

KEYWORDS

3D modeling

Video

Video coding

Transformers

Video surveillance

Action recognition

Data modeling

Show All Keywords

Keywords/Phrases

Search In:

Publication Years