no code implementations • 17 May 2024 • Zikun Zhou, Wentao Xiong, Li Zhou, Xin Li, Zhenyu He, YaoWei Wang
As images and texts are mapped to uncoupled feature spaces, they face the arduous task of learning Vision-Language~(VL) relation modeling from scratch.
Referring Expression Segmentation Referring Video Object Segmentation +3