no code implementations • 10 May 2024 • Dongwei Sun, Yajie Bao, Xiangyong Cao
Specifically, the SFT network consists of three main components, i. e. a high-level features extractor based on a convolutional neural network (CNN), a sparse focus attention mechanism-based transformer encoder network designed to locate and capture changing regions in dual-temporal images, and a description decoder that embeds images and words to generate sentences for captioning differences.
no code implementations • 11 Oct 2022 • Dongwei Sun, Zhuolin Gao
Over recent years, based on guidance of attention mechanism compared with CNN which overcomes the problems of lacking of interaction between different channels, and effective capturing and aggregating contextual information.