Efficient multi-step reasoning attention network for visual question answering

Haotian Zhang; Wei Wu; Meng Zhang

doi:10.1117/12.2623218

16 February 2022 Efficient multi-step reasoning attention network for visual question answering

Haotian Zhang, Wei Wu, Meng Zhang

Proceedings Volume 12083, Thirteenth International Conference on Graphics and Image Processing (ICGIP 2021); 120831S (2022) https://doi.org/10.1117/12.2623218
Event: Thirteenth International Conference on Graphics and Image Processing (ICGIP 2021), 2021, Kunming, China

Abstract

Visual question answering task requires utilizing the content of the question to locate the corresponding regions of the image. But the traditional attention-based VQA methods can not accurately match the regions of the image that is relevant to the question, which results in less satisfactory performance. In this paper, an Efficient Multi-step Reasoning Attention Network (EMRA), which is mainly composed of the multi-step reasoning attention module and the G-LReLU non-linear layers, is proposed to address this problem. Specifically, the multi-step reasoning attention module combines the initial visual features, question features and the jointed features to generate more effective attended features, which can precisely represent the regions information of an image related to question. Then, the attended visual features generated by multi-step reasoning attention model and the question features are fed into the G-LReLU non-linear layers executing non-linear transformation to better fusion for answer prediction. In addition, considering the relationship between the scaling and the reasoning steps, as the number of inference steps increases, increasing the model width will improve the accuracy of our model. Experimental results on the VQA v2.0 dataset demonstrate that our model significantly outperforms the Bottom up and Top-down Attention based methods, and can be competitive with state-of-the-art models.

Citation Download Citation

Haotian Zhang, Wei Wu, and Meng Zhang "Efficient multi-step reasoning attention network for visual question answering", Proc. SPIE 12083, Thirteenth International Conference on Graphics and Image Processing (ICGIP 2021), 120831S (16 February 2022); https://doi.org/10.1117/12.2623218

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available

Members: $17.00

Non-members: $21.00 ADD TO CART

PROCEEDINGS
7 PAGES

DOWNLOAD PAPER SAVE TO MY LIBRARY

GET CITATION

RIGHTS & PERMISSIONS

Get copyright permission Get copyright permission on Copyright Marketplace

KEYWORDS

Visualization

Data modeling

Performance modeling

Image fusion

Visual process modeling

Feature extraction

Unattended ground sensors

Show All Keywords

Keywords/Phrases

Search In:

Publication Years