1 code implementation • 29 May 2024 • Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, Mohit Bansal
Recently, many long video-language understanding approaches have leveraged the reasoning capabilities of Large Language Models (LLMs) to perform long video QA, transforming videos into densely sampled frame captions, and asking LLMs to respond to text queries over captions.
Ranked #1 on Zero-Shot Video Question Answer on IntentQA
1 code implementation • 28 May 2024 • Jaehong Yoon, Shoubin Yu, Mohit Bansal
This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework that supports multiple video editing capabilities such as removal, addition, and modification, through a unified pipeline.
no code implementations • NeurIPS 2021 • Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, Chuang Gan
This paper introduces a new benchmark that evaluates the situated reasoning ability via situation abstraction and logic-grounded question answering for real-world videos, called Situated Reasoning in Real-World Videos (STAR Benchmark).
1 code implementation • 8 Feb 2024 • Shoubin Yu, Jaehong Yoon, Mohit Bansal
Furthermore, we propose a fusion module designed to compress multimodal queries, maintaining computational efficiency in the LLM while combining additional modalities.
Ranked #1 on Question Answering on SQA3D
1 code implementation • 28 Dec 2023 • Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, Gedas Bertasius
Furthermore, we show that a specialized prompt that asks the LLM first to summarize the noisy short-term visual captions and then answer a given input question leads to a significant LVQA performance boost.
Ranked #1 on Zero-Shot Video Question Answer on NExT-GQA
1 code implementation • NeurIPS 2023 • Shoubin Yu, Jaemin Cho, Prateek Yadav, Mohit Bansal
SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2.
Ranked #3 on Video Question Answering on STAR Benchmark
1 code implementation • 7 Dec 2021 • Shoubin Yu, Zhongyin Zhao, Haoshu Fang, Andong Deng, Haisheng Su, Dongliang Wang, Weihao Gan, Cewu Lu, Wei Wu
Different from pixel-based anomaly detection methods, pose-based methods utilize highly-structured skeleton data, which decreases the computational burden and also avoids the negative impact of background noise.
Anomaly Detection In Surveillance Videos Optical Flow Estimation +1