no code implementations • 12 Apr 2024 • Yichen Yan, Xingjian He, Sihan Chen, Jing Liu
In this paper, we introduce CRFormer, a model that iteratively calibrates multi-modal features in the transformer decoder.
1 code implementation • 17 Feb 2024 • Wenxuan Wang, Yisi Zhang, Xingjian He, Yichen Yan, Zijia Zhao, Xinlong Wang, Jing Liu
Previous datasets and methods for classic VG task mainly rely on the prior assumption that the given expression must literally refer to the target object, which greatly impedes the practical deployment of agents in real-world scenarios.
no code implementations • 18 Aug 2023 • Yichen Yan, Xingjian He, Wenxuan Wang, Sihan Chen, Jing Liu
In previous approaches, fused vision-language features are directly fed into a decoder and pass through a convolution with a fixed kernel to obtain the result, which follows a similar pattern as traditional image segmentation.
no code implementations • 24 May 2023 • Yichen Yan, Xingjian He, Wenxuan Wan, Jing Liu
However, this task is challenging due to the distinct data properties between text and image, and the randomness introduced by diverse objects and unrestricted language expression.