Paper
16 February 2022 Efficient multi-step reasoning attention network for visual question answering
Haotian Zhang, Wei Wu, Meng Zhang
Author Affiliations +
Proceedings Volume 12083, Thirteenth International Conference on Graphics and Image Processing (ICGIP 2021); 120831S (2022) https://doi.org/10.1117/12.2623218
Event: Thirteenth International Conference on Graphics and Image Processing (ICGIP 2021), 2021, Kunming, China
Abstract
Visual question answering task requires utilizing the content of the question to locate the corresponding regions of the image. But the traditional attention-based VQA methods can not accurately match the regions of the image that is relevant to the question, which results in less satisfactory performance. In this paper, an Efficient Multi-step Reasoning Attention Network (EMRA), which is mainly composed of the multi-step reasoning attention module and the G-LReLU non-linear layers, is proposed to address this problem. Specifically, the multi-step reasoning attention module combines the initial visual features, question features and the jointed features to generate more effective attended features, which can precisely represent the regions information of an image related to question. Then, the attended visual features generated by multi-step reasoning attention model and the question features are fed into the G-LReLU non-linear layers executing non-linear transformation to better fusion for answer prediction. In addition, considering the relationship between the scaling and the reasoning steps, as the number of inference steps increases, increasing the model width will improve the accuracy of our model. Experimental results on the VQA v2.0 dataset demonstrate that our model significantly outperforms the Bottom up and Top-down Attention based methods, and can be competitive with state-of-the-art models.
© (2022) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Haotian Zhang, Wei Wu, and Meng Zhang "Efficient multi-step reasoning attention network for visual question answering", Proc. SPIE 12083, Thirteenth International Conference on Graphics and Image Processing (ICGIP 2021), 120831S (16 February 2022); https://doi.org/10.1117/12.2623218
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Visualization

Data modeling

Performance modeling

Image fusion

Visual process modeling

Feature extraction

Unattended ground sensors

Back to Top