1 code implementation • ECCV 2020 • Mingfei Han, Yali Wang, Xiaojun Chang, Yu Qiao
Recent studies have shown that, context aggregating information from proposals in different frames can clearly enhance the performance of video object detection.
Ranked #12 on Video Object Detection on ImageNet VID
no code implementations • 24 Apr 2024 • Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, Wenqi Shao
Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation.
1 code implementation • 24 Mar 2024 • Yifei HUANG, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, LiMin Wang, Yu Qiao
Along with the videos we record high-quality gaze data and provide detailed multimodal annotations, formulating a playground for modeling the human ability to bridge asynchronous procedural actions from different viewpoints.
2 code implementations • 22 Mar 2024 • Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei HUANG, Yu Qiao, Yali Wang, LiMin Wang
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.
Ranked #1 on Zero-Shot Video Question Answer on MVBench
3 code implementations • 11 Mar 2024 • Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, LiMin Wang, Yu Qiao
Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain.
no code implementations • 29 Feb 2024 • BoYu Chen, Siran Chen, Kunchang Li, Qinglin Xu, Yu Qiao, Yali Wang
Finally, we blend external multimodal knowledge in Adapt stage, by inserting multimodal knowledge adaptation modules into networks.
no code implementations • 26 Jan 2024 • Chaochao Lu, Chen Qian, Guodong Zheng, Hongxing Fan, Hongzhi Gao, Jie Zhang, Jing Shao, Jingyi Deng, Jinlan Fu, Kexin Huang, Kunchang Li, Lijun Li, LiMin Wang, Lu Sheng, Meiqi Chen, Ming Zhang, Qibing Ren, Sirui Chen, Tao Gui, Wanli Ouyang, Yali Wang, Yan Teng, Yaru Wang, Yi Wang, Yinan He, Yingchun Wang, Yixu Wang, Yongting Zhang, Yu Qiao, Yujiong Shen, Yurong Mou, Yuxi Chen, Zaibin Zhang, Zhelun Shi, Zhenfei Yin, Zhipin Wang
Multi-modal Large Language Models (MLLMs) have shown impressive abilities in generating reasonable responses with respect to multi-modal contents.
1 code implementation • 17 Jan 2024 • Shaobin Zhuang, Kunchang Li, Xinyuan Chen, Yaohui Wang, Ziwei Liu, Yu Qiao, Yali Wang
More importantly, Vlogger can generate over 5-minute vlogs from open-world descriptions, without loss of video coherence on script and actor.
1 code implementation • 19 Dec 2023 • Siran Chen, Yue Ma, Yu Qiao, Yali Wang
It mimics various missing cases by randomly masking features of different camera views, then leverages the original features of these views as self-supervision, and reconstructs the masked ones with the distinct spatio-temporal context across views.
no code implementations • 8 Dec 2023 • Hongjie Zhang, Yi Liu, Lu Dong, Yifei HUANG, Zhen-Hua Ling, Yali Wang, LiMin Wang, Yu Qiao
While several long-form VideoQA datasets have been introduced, the length of both videos used to curate questions and sub-clips of clues leveraged to answer those questions have not yet reached the criteria for genuine long-form video understanding.
2 code implementations • 28 Nov 2023 • Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, LiMin Wang, Yu Qiao
With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models.
Ranked #1 on Video Question Answering on IntentQA
no code implementations • 31 Oct 2023 • Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, Ziwei Liu
The goal is to generate high-quality long videos with smooth and creative transitions between scenes and varying lengths of shot-level videos.
1 code implementation • 30 Oct 2023 • Yizhuo Li, Kunchang Li, Yinan He, Yi Wang, Yali Wang, LiMin Wang, Yu Qiao, Ping Luo
Building video-language foundation models is costly and difficult due to the redundant nature of video data and the lack of high-quality video-language datasets.
1 code implementation • 13 Jul 2023 • Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Conghui He, Ping Luo, Ziwei Liu, Yali Wang, LiMin Wang, Yu Qiao
Specifically, we utilize a multi-scale approach to generate video-related descriptions.
1 code implementation • 22 May 2023 • Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei HUANG, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, LiMin Wang
Building upon this insight, we propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs from natural language processing (NLP) for video sequence understanding.
1 code implementation • 10 May 2023 • Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, LiMin Wang, Yu Qiao
In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat.
Ranked #1 on Question Answering on NExT-QA (Open-ended VideoQA)
Video-based Generative Performance Benchmarking (Consistency) Video-based Generative Performance Benchmarking (Contextual Understanding) +5
2 code implementations • 9 May 2023 • Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, Kunchang Li, Zhe Chen, Xue Yang, Xizhou Zhu, Yali Wang, LiMin Wang, Ping Luo, Jifeng Dai, Yu Qiao
Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iGPT significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2.
1 code implementation • CVPR 2023 • LiMin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, Yu Qiao
Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90. 0% on K400 and 89. 9% on K600) and Something-Something (68. 7% on V1 and 77. 0% on V2).
Ranked #1 on Self-Supervised Action Recognition on UCF101 (using extra training data)
1 code implementation • ICCV 2023 • Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, LiMin Wang, Yu Qiao
Previous VFMs rely on Image Foundation Models (IFMs), which face challenges in transferring to the video domain.
Ranked #1 on Video Retrieval on SSv2-template retrieval (using extra training data)
2 code implementations • 14 Mar 2023 • Renrui Zhang, Liuhui Wang, Ziyu Guo, Yali Wang, Peng Gao, Hongsheng Li, Jianbo Shi
We present a Non-parametric Network for 3D point cloud analysis, Point-NN, which consists of purely non-learnable components: farthest point sampling (FPS), k-nearest neighbors (k-NN), and pooling operations, with trigonometric functions.
Ranked #1 on Training-free 3D Part Segmentation on ShapeNet-Part
Supervised Only 3D Point Cloud Classification Training-free 3D Part Segmentation +1
no code implementations • ICCV 2023 • Mingfei Han, Yali Wang, Zhihui Li, Lina Yao, Xiaojun Chang, Yu Qiao
To tackle this problem, we propose a concise Hybrid Temporal-scale Multimodal Learning (HTML) framework, which can effectively align lingual and visual features to discover core object semantics in the video, by learning multimodal interaction hierarchically from different temporal scales.
Ranked #6 on Referring Video Object Segmentation on Refer-YouTube-VOS (using extra training data)
no code implementations • ICCV 2023 • Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, LiMin Wang, Yu Qiao
The prolific performances of Vision Transformers (ViTs) in image tasks have prompted research into adapting the image ViTs for video tasks.
1 code implementation • CVPR 2023 • Renrui Zhang, Liuhui Wang, Yali Wang, Peng Gao, Hongsheng Li, Jianbo Shi
We present a Non-parametric Network for 3D point cloud analysis, Point-NN, which consists of purely non-learnable components: farthest point sampling (FPS), k-nearest neighbors (k-NN), and pooling operations, with trigonometric functions.
no code implementations • CVPR 2023 • Mingye Xu, Mutian Xu, Tong He, Wanli Ouyang, Yali Wang, Xiaoguang Han, Yu Qiao
Besides, such scenes with progressive masking ratios can also serve to self-distill their intrinsic spatial consistency, requiring to learn the consistent representations from unmasked areas.
2 code implementations • 6 Dec 2022 • Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, LiMin Wang, Yu Qiao
Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications.
Ranked #1 on Action Recognition on Something-Something V1 (using extra training data)
2 code implementations • 17 Nov 2022 • Guo Chen, Sen Xing, Zhe Chen, Yi Wang, Kunchang Li, Yizhuo Li, Yi Liu, Jiahao Wang, Yin-Dong Zheng, Bingkun Huang, Zhiyu Zhao, Junting Pan, Yifei HUANG, Zun Wang, Jiashuo Yu, Yinan He, Hongjie Zhang, Tong Lu, Yali Wang, LiMin Wang, Yu Qiao
In this report, we present our champion solutions to five tracks at Ego4D challenge.
Ranked #1 on State Change Object Detection on Ego4D
3 code implementations • 17 Nov 2022 • Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, LiMin Wang, Yu Qiao
UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format.
no code implementations • 20 Oct 2022 • Yi Liu, Xuan Zhang, Ying Li, Guixin Liang, Yabing Jiang, Lixia Qiu, Haiping Tang, Fei Xie, Wei Yao, Yi Dai, Yu Qiao, Yali Wang
For this reason, we propose to advance research areas of video understanding, with a shift from traditional action recognition to industrial anomaly analysis.
no code implementations • 28 Sep 2022 • BoYu Chen, Yu Qiao, Yali Wang
Second, these activities are naturally distributed in a long-tailed way.
no code implementations • TIP 2022 • Peiqin Zhuang, Yu Guo, Zhipeng Yu, Luping Zhou, Lei Bai, Ding Liang, Zhiyong Wang, Yali Wang, Wanli Ouyang
To address this issue, we introduce a Motion Diversification and Selection (MoDS) module to generate diversified spatio-temporal motion features and then select the suitable motion representation dynamically for categorizing the input video.
Ranked #18 on Action Recognition on Something-Something V1
no code implementations • 12 Jul 2022 • Mingye Xu, Yali Wang, Yihao Liu, Tong He, Yu Qiao
Inspired by prompting approaches from NLP, we creatively reinterpret point cloud generation and refinement as the prompting and predicting stages, respectively.
1 code implementation • CVPR 2022 • Mengzhe He, Yali Wang, Jiaxi Wu, Yiru Wang, Hanqing Li, Bo Li, Weihao Gan, Wei Wu, Yu Qiao
It can adaptively enhance source detector to perceive objects in a target image, by leveraging target proposal contexts from iterative cross-attention.
no code implementations • CVPR 2022 • Jiaxi Wu, Jiaxin Chen, Mengzhe He, Yiru Wang, Bo Li, Bingqi Ma, Weihao Gan, Wei Wu, Yali Wang, Di Huang
Specifically, TRKP adopts the teacher-student framework, where the multi-head teacher network is built to extract knowledge from labeled source domains and guide the student network to learn detectors in unlabeled target domain.
no code implementations • CVPR 2022 • Mingfei Han, David Junhao Zhang, Yali Wang, Rui Yan, Lina Yao, Xiaojun Chang, Yu Qiao
Learning spatial-temporal relation among multiple actors is crucial for group activity recognition.
7 code implementations • 24 Jan 2022 • Kunchang Li, Yali Wang, Junhao Zhang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, Yu Qiao
Different from the typical transformer blocks, the relation aggregators in our UniFormer block are equipped with local and global token affinity respectively in shallow and deep layers, allowing to tackle both redundancy and dependency for efficient and effective representation learning.
Ranked #154 on Image Classification on ImageNet
no code implementations • 20 Jan 2022 • Mingye Xu, Yali Wang, Zhipeng Zhou, Hongbin Xu, Yu Qiao
To fill this gap, we propose a generic Contour-Perturbed Reconstruction Network (CP-Net), which can effectively guide self-supervised reconstruction to learn semantic content in the point cloud, and thus promote discriminative power of point cloud representation.
2 code implementations • 12 Jan 2022 • Kunchang Li, Yali Wang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, Yu Qiao
For Something-Something V1 and V2, our UniFormer achieves new state-of-the-art performances of 60. 9% and 71. 2% top-1 accuracy respectively.
2 code implementations • 22 Dec 2021 • Liang Pan, Tong Wu, Zhongang Cai, Ziwei Liu, Xumin Yu, Yongming Rao, Jiwen Lu, Jie zhou, Mingye Xu, Xiaoyuan Luo, Kexue Fu, Peng Gao, Manning Wang, Yali Wang, Yu Qiao, Junsheng Zhou, Xin Wen, Peng Xiang, Yu-Shen Liu, Zhizhong Han, Yuanjie Yan, Junyi An, Lifa Zhu, Changwei Lin, Dongrui Liu, Xin Li, Francisco Gómez-Fernández, Qinlong Wang, Yang Yang
Based on the MVP dataset, this paper reports methods and results in the Multi-View Partial Point Cloud Challenge 2021 on Completion and Registration.
1 code implementation • 24 Nov 2021 • Zhuofan Zong, Kunchang Li, Guanglu Song, Yali Wang, Yu Qiao, Biao Leng, Yu Liu
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs by dynamic token aggregation.
2 code implementations • 24 Nov 2021 • David Junhao Zhang, Kunchang Li, Yali Wang, Yunpeng Chen, Shashwat Chandra, Yu Qiao, Luoqi Liu, Mike Zheng Shou
With such multi-dimension and multi-scale factorization, our MorphMLP block can achieve a great accuracy-computation balance.
Ranked #38 on Action Recognition on Something-Something V2 (using extra training data)
3 code implementations • ICLR 2022 • Kunchang Li, Yali Wang, Gao Peng, Guanglu Song, Yu Liu, Hongsheng Li, Yu Qiao
For Something-Something V1 and V2, our UniFormer achieves new state-of-the-art performances of 60. 8% and 71. 4% top-1 accuracy respectively.
Ranked #8 on Action Recognition on Something-Something V1
no code implementations • 29 Sep 2021 • Zhuofan Zong, Kunchang Li, Guanglu Song, Yali Wang, Yu Qiao, Biao Leng, Yu Liu
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs by dynamic token aggregation.
no code implementations • 15 Sep 2021 • Junhao Zhang, Yali Wang, Zhipeng Zhou, Tianyu Luan, Zhe Wang, Yu Qiao
Graph Convolution Network (GCN) has been successfully used for 3D human pose estimation in videos.
Ranked #11 on 3D Human Pose Estimation on HumanEva-I
1 code implementation • ICCV 2021 • Hongbin Xu, Zhipeng Zhou, Yali Wang, Wenxiong Kang, Baigui Sun, Hao Li, Yu Qiao
Specially, the limitations can be categorized into two types: ambiguious supervision in foreground and invalid supervision in background.
no code implementations • 27 Aug 2021 • Fanxin Xu, Xiangkui Li, Hang Yang, Yali Wang, Wei Xiang
In this work, an object detector based on YOLOF has been proposed to detect blood cell objects such as red blood cells, white blood cells and platelets.
1 code implementation • ICLR 2021 • Kunchang Li, Xianhang Li, Yali Wang, Jun Wang, Yu Qiao
It can learn to exploit spatial, temporal and channel attention in a high-dimensional manner, to improve the cooperative power of all the feature dimensions in our CT-Module.
Ranked #18 on Action Recognition on Something-Something V1
1 code implementation • 24 May 2021 • Yi Liu, LiMin Wang, Yali Wang, Xiao Ma, Yu Qiao
Temporal action localization (TAL) is an important and challenging problem in video understanding.
no code implementations • 16 Mar 2021 • Tianyu Luan, Yali Wang, Junhao Zhang, Zhe Wang, Zhipeng Zhou, Yu Qiao
By coupling advanced 3D pose estimators and HMR in a serial or parallel manner, these two frameworks can effectively correct human mesh with guidance of a concise pose calibration module.
Ranked #4 on 3D Human Pose Estimation on Surreal
no code implementations • 23 Jan 2021 • Yali Wang, Steffen Limmer, Markus Olhofer, Michael Emmerich, Thomas Baeck
A preference based multi-objective evolutionary algorithm is proposed for generating solutions in an automatically detected knee point region.
1 code implementation • CVPR 2020 • Xianhang Li, Yali Wang, Zhipeng Zhou, Yu Qiao
Our SmallBig network outperforms a number of recent state-of-the-art approaches, in terms of accuracy and/or efficiency.
no code implementations • 15 Apr 2020 • Yali Wang, André Deutz, Thomas Bäck, Michael Emmerich
Given a point in $m$-dimensional objective space, any $\varepsilon$-ball of a point can be partitioned into the incomparable, the dominated and dominating region.
no code implementations • 14 Apr 2020 • Yali Wang, Bas van Stein, Michael T. M. Emmerich, Thomas Bäck
A customized multi-objective evolutionary algorithm (MOEA) is proposed for the multi-objective flexible job shop scheduling problem (FJSP).
1 code implementation • 16 Mar 2020 • Ze Yang, Yali Wang, Xianyu Chen, Jianzhuang Liu, Yu Qiao
Few-shot object detection is a challenging but realistic scenario, where only a few annotated training images are available for training detectors.
1 code implementation • 24 Feb 2020 • Peiqin Zhuang, Yali Wang, Yu Qiao
These distinct gate vectors inherit mutual context on semantic differences, which allow API-Net to attentively capture contrastive clues by pairwise interaction between two images.
Ranked #13 on Fine-Grained Image Classification on Stanford Cars
no code implementations • 12 Feb 2020 • Hao Chen, Yali Wang, Guoyou Wang, Xiang Bai, Yu Qiao
Inspired by this procedure of learning to detect, we propose a novel Progressive Object Transfer Detection (POTD) framework.
no code implementations • 9 Feb 2019 • Chu Qin, Ying Tan, Shang Ying Chen, Xian Zeng, Xingxing Qi, Tian Jin, Huan Shi, Yiwei Wan, Yu Chen, Jingfeng Li, Weidong He, Yali Wang, Peng Zhang, Feng Zhu, Hongping Zhao, Yuyang Jiang, Yuzong Chen
We ex-plored the superior learning capability of deep autoencoders for unsupervised clustering of 1. 39 mil-lion bioactive molecules into band-clusters in a 3-dimensional latent chemical space.
no code implementations • CVPR 2018 • Yali Wang, Lei Zhou, Yu Qiao
To mimic this capacity, we propose a novel Hybrid Video Memory (HVM) machine, which can hallucinate temporal features of still images from video memory, in order to boost action recognition with few still images.
1 code implementation • 5 Mar 2018 • Hao Chen, Yali Wang, Guoyou Wang, Yu Qiao
Second, we introduce a novel regularized transfer learning framework for low-shot detection, where the transfer knowledge (TK) and background depression (BD) regularizations are proposed to leverage object knowledge respectively from source and target domains, in order to further enhance fine-tuning with a few target images.
Ranked #22 on Few-Shot Object Detection on MS-COCO (30-shot)
1 code implementation • 2017 IEEE International Conference on Computer Vision (ICCV) 2017 • Wenbin Du, Yali Wang, Yu Qiao
Firstly, unlike previous works on pose-related action recognition, our RPAN is an end-to-end recurrent network which can exploit important spatial-temporal evolutions of human pose to assist action recognition in a unified framework.
Ranked #5 on Skeleton Based Action Recognition on J-HMDB
1 code implementation • 1 Sep 2016 • Zhe Wang, Li-Min Wang, Yali Wang, Bo-Wen Zhang, Yu Qiao
In this paper, we propose a hybrid representation, which leverages the discriminative capacity of CNNs and the simplicity of descriptor encoding schema for image recognition, with a focus on scene recognition.
no code implementations • NeurIPS 2012 • Yali Wang, Brahim Chaib-Draa
We present a novel marginalized particle Gaussian process (MPGP) regression, which provides a fast, accurate online Bayesian filtering framework to model the latent function.