Paper
19 July 2024 Dual-encoder-based image-text fusion algorithm
Min Xia, Zhonghai Wu
Author Affiliations +
Proceedings Volume 13213, International Conference on Image Processing and Artificial Intelligence (ICIPAl 2024); 132130X (2024) https://doi.org/10.1117/12.3035185
Event: International Conference on Image Processing and Artificial Intelligence (ICIPAl2024), 2024, Suzhou, China
Abstract
Many sectors are challenged by how to effectively represent knowledge in files that contain multiple images closely related to text, and how to make models understand the relationship between images and text. Contrastive Language-Image Pre-training (CLIP) and Bootstrapping Language-Image Pre-training (BLIP) acquire the capability of understanding the image-text relationship through large-scale model pre-training. CLIP not only considers images and their related text but also contrasts images with massive irrelevant text, to improve its capability of generalizing the relationship between images and related text. BLIP enhances its understanding of complex image-text relationships by pre-training and fine-tuning matched image-text pairs. This paper presents an image-text fusion algorithm based on CLIP and BLIP, which gives an accurate and consistent picture of the image-text relevance by fully using CLIP’s image-text generalization capacity and BLIP’s capacity of understanding the complex image-text relationship.
(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Min Xia and Zhonghai Wu "Dual-encoder-based image-text fusion algorithm", Proc. SPIE 13213, International Conference on Image Processing and Artificial Intelligence (ICIPAl 2024), 132130X (19 July 2024); https://doi.org/10.1117/12.3035185
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Image fusion

Education and training

Visual process modeling

Data modeling

Design

Image processing

Image segmentation

Back to Top