no code implementations • 18 May 2024 • Junzhang Liu, Zhecan Wang, Hammad Ayyubi, Haoxuan You, Chris Thomas, Rui Sun, Shih-Fu Chang, Kai-Wei Chang
Despite the widespread adoption of Vision-Language Understanding (VLU) benchmarks such as VQA v2, OKVQA, A-OKVQA, GQA, VCR, SWAG, and VisualCOMET, our analysis reveals a pervasive issue affecting their integrity: these benchmarks contain samples where answers rely on assumptions unsupported by the provided context.
no code implementations • 27 Mar 2024 • Ali Zare, Yulei Niu, Hammad Ayyubi, Shih-Fu Chang
(3) Annotation cost: Annotating instructional videos with step-level labels (i. e., timestamp) or sequence-level labels (i. e., action category) is demanding and labor-intensive, limiting its generalizability to large-scale datasets. In this work, we propose a new and practical setting, called adaptive procedure planning in instructional videos, where the procedure length is not fixed or pre-determined.
1 code implementation • 22 Oct 2022 • Long Chen, Yulei Niu, Brian Chen, Xudong Lin, Guangxing Han, Christopher Thomas, Hammad Ayyubi, Heng Ji, Shih-Fu Chang
Specifically, given an article and a relevant video, WSAG aims to localize all ``groundable'' sentences to the video, and these sentences are possibly at different semantic scales.