no code implementations • 6 May 2024 • Jinwei Han, Yingguo Gao, Zhiwen Lin, Ke Yan, Shouhong Ding, Yuan Gao, Gui-Song Xia
Specifically, we introduce a Dual Attention Block (DAB) for visual-semantic relationship mining, which enriches visual information by multi-level feature fusion and conducts spatial attention for visual to semantic embedding.
no code implementations • 9 Apr 2024 • Jinwei Han, Zhiwen Lin, Zhongyisun Sun, Yingguo Gao, Ke Yan, Shouhong Ding, Yuan Gao, Gui-Song Xia
Specifically, two types of anchors are elaborated in our method, including i) text-compensated anchor which uses the images from the finetune set but enriches the text supervision from a pretrained captioner, ii) image-text-pair anchor which is retrieved from the dataset similar to pretraining data of CLIP according to the downstream task, associating with the original CLIP text with rich semantics.
no code implementations • 14 Dec 2023 • Yi Xin, Junlong Du, Qiang Wang, Zhiwen Lin, Ke Yan
Extensive experiments on four dense scene understanding tasks demonstrate the superiority of VMT-Adapter(-Lite), achieving a 3. 96%(1. 34%) relative improvement compared to single-task full fine-tuning, while utilizing merely ~1% (0. 36%) trainable parameters of the pre-trained model.
no code implementations • 20 Aug 2023 • Shuman Fang, Zhiwen Lin, Ke Yan, Jie Li, Xianming Lin, Rongrong Ji
However, these methods ignore the relationship among humans, objects, and interactions: 1) human features are more contributive than object ones to interaction prediction; 2) interactive information disturbs the detection of objects but helps human detection.
no code implementations • 2 Aug 2021 • Nenglun Chen, Xingjia Pan, Runnan Chen, Lei Yang, Zhiwen Lin, Yuqiang Ren, Haolei Yuan, Xiaowei Guo, Feiyue Huang, Wenping Wang
We study the problem of weakly supervised grounded image captioning.
1 code implementation • CVPR 2021 • Xingjia Pan, Yingguo Gao, Zhiwen Lin, Fan Tang, WeiMing Dong, Haolei Yuan, Feiyue Huang, Changsheng Xu
Weakly supervised object localization(WSOL) remains an open problem given the deficiency of finding object extent information using a classification network.