1 code implementation • 29 Apr 2024 • Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, Mike Zheng Shou
By drawing the granular classification and landscapes of hallucination causes, evaluation benchmarks, and mitigation methods, this survey aims to deepen the understanding of hallucinations in MLLMs and inspire further advancements in the field.
1 code implementation • 26 Mar 2024 • Jiamian Wang, Guohao Sun, Pichao Wang, Dongfang Liu, Sohail Dianat, Majid Rabbani, Raghuveer Rao, Zhiqiang Tao
Correspondingly, a single text embedding may be less expressive to capture the video embedding and empower the retrieval.
1 code implementation • 20 Nov 2023 • Wenhao Li, Mengyuan Liu, Hong Liu, Pichao Wang, Jialun Cai, Nicu Sebe
Transformers have been successfully applied in the field of video-based 3D human pose estimation.
no code implementations • 19 Oct 2023 • Lijuan Zhou, Xiang Meng, Zhihuan Liu, Mengqi Wu, Zhimin Gao, Pichao Wang
This paper presents a comprehensive survey of pose-based applications utilizing deep learning, encompassing pose estimation, pose tracking, and action recognition. Pose estimation involves the determination of human joint positions from images or image sequences.
2 code implementations • 15 Sep 2023 • Henry Hengyuan Zhao, Pichao Wang, Yuyang Zhao, Hao Luo, Fan Wang, Mike Zheng Shou
Experiments on 19 visual transfer learning downstream tasks demonstrate that our SCT outperforms full fine-tuning on 18 out of 19 tasks by adding only 0. 11M parameters of the ViT-B, which is 780$\times$ fewer than its full fine-tuning counterpart.
1 code implementation • 23 Aug 2023 • Yujun Ma, Benjia Zhou, Ruili Wang, Pichao Wang
RGB-D action and gesture recognition remain an interesting topic in human-centered scene understanding, primarily due to the multiple granularities and large variation in human motion.
no code implementations • ICCV 2023 • Shuning Chang, Pichao Wang, Hao Luo, Fan Wang, Mike Zheng Shou
Therefore, we propose the path pruning and EnsembleScale skills for improvement, which cut out the underperforming paths and re-weight the ensemble components, respectively, to optimize the path combination and make the short paths focus on providing high-quality representation for subsequent paths.
no code implementations • ICCV 2023 • Sarah Ibrahimi, Xiaohang Sun, Pichao Wang, Amanmeet Garg, Ashutosh Sanan, Mohamed Omar
Nonetheless, the objective of the text-to-video retrieval task is to capture the complementary audio and video information that is pertinent to the text query rather than simply achieving better audio and video alignment.
Ranked #10 on Video Retrieval on MSR-VTT
no code implementations • 1 Apr 2023 • Shuning Chang, Pichao Wang, Fan Wang, Jiashi Feng, Mike Zheng Show
Specifically, one branch focuses on detection representation for actor detection, and the other one for action recognition.
2 code implementations • CVPR 2023 • Qitao Zhao, Ce Zheng, Mengyuan Liu, Pichao Wang, Chen Chen
However, in real scenarios, the performance of PoseFormer and its follow-ups is limited by two factors: (a) The length of the input joint sequence; (b) The quality of 2D joint detection.
no code implementations • CVPR 2023 • Jue Wang, Wentao Zhu, Pichao Wang, Xiang Yu, Linda Liu, Mohamed Omar, Raffay Hamid
To address this limitation, we present a novel Selective S4 (i. e., S5) model that employs a lightweight mask generator to adaptively select informative image tokens resulting in more efficient and accurate modeling of long-term spatiotemporal dependencies in videos.
Ranked #2 on Video Classification on Breakfast
2 code implementations • 22 Mar 2023 • Hansheng Chen, Wei Tian, Pichao Wang, Fan Wang, Lu Xiong, Hao Li
In this paper, we propose the EPro-PnP, a probabilistic PnP layer for general end-to-end pose estimation, which outputs a distribution of pose with differentiable probability density on the SE(3) manifold.
Ranked #4 on 6D Pose Estimation using RGB on LineMOD
1 code implementation • CVPR 2023 • Shuning Chang, Pichao Wang, Ming Lin, Fan Wang, David Junhao Zhang, Rong Jin, Mike Zheng Shou
In this work, we propose a novel Semantic Token ViT (STViT), for efficient global and local vision transformers, which can also be revised to serve as backbone for downstream tasks.
no code implementations • 14 Mar 2023 • Hengyuan Zhao, Hao Luo, Yuyang Zhao, Pichao Wang, Fan Wang, Mike Zheng Shou
In view of the practicality of PETL, previous works focus on tuning a small set of parameters for each downstream task in an end-to-end manner while rarely considering the task distribution shift issue between the pre-training task and the downstream task.
1 code implementation • 11 Jan 2023 • Bo Dong, Pichao Wang, Fan Wang
On the ADE20K dataset, our model achieves 41. 8 mIoU and 4. 6 GFLOPs, which is 4. 4 mIoU higher than Segformer, with 45% less GFLOPs.
1 code implementation • 16 Nov 2022 • Benjia Zhou, Pichao Wang, Jun Wan, Yanyan Liang, Fan Wang
Although improving motion recognition to some extent, these methods still face sub-optimal situations in the following aspects: (i) Data augmentation, i. e., the scale of the RGB-D datasets is still limited, and few efforts have been made to explore novel data augmentation strategies for videos; (ii) Optimization mechanism, i. e., the tightly space-time-entangled network structure brings more challenges to spatiotemporal information modeling; And (iii) cross-modal knowledge fusion, i. e., the high similarity between multimodal representations caused to insufficient late fusion.
Ranked #3 on Action Recognition on NTU RGB+D
1 code implementation • NIPS 2022 • Zhenyu Wang, Hao Luo, Pichao Wang, Feng Ding, Fan Wang, Hao Li
Although Vision transformers (ViTs) have recently dominated many vision tasks, deploying ViT models on resource-limited devices remains a challenging problem.
no code implementations • 6 Oct 2022 • Zhimin Gao, Peitao Wang, Pei Lv, Xiaoheng Jiang, Qidong Liu, Pichao Wang, Mingliang Xu, Wanqing Li
Besides, these methods directly calculate the pair-wise global self-attention equally for all the joints in both the spatial and temporal dimensions, undervaluing the effect of discriminative local joints and the short-range temporal dynamics.
no code implementations • 29 Sep 2022 • Benjia Zhou, Pichao Wang, Jun Wan, Yanyan Liang, Fan Wang
To achieve these two purposes, we propose a novel data-centric ViT training framework to dynamically measure the ``difficulty'' of training samples and generate ``effective'' samples for models at different training stages.
1 code implementation • 21 Sep 2022 • Zihui Guo, Yonghong Hou, Pichao Wang, Zhimin Gao, Mingliang Xu, Wanqing Li
It has been studied either using first person vision (FPV) or third person vision (TPV).
1 code implementation • CVPR 2022 • Hansheng Chen, Pichao Wang, Fan Wang, Wei Tian, Lu Xiong, Hao Li
The 2D-3D coordinates and corresponding weights are treated as intermediate variables learned by minimizing the KL divergence between the predicted and target pose distribution.
Ranked #6 on 6D Pose Estimation using RGB on LineMOD
no code implementations • 19 Feb 2022 • Shanshan Wang, Lei Zhang, Pichao Wang
In our work, considering the different importance of pair-wise samples for both feature learning and domain alignment, we deduce our BP-Triplet loss for effective UDA from the perspective of Bayesian learning.
no code implementations • 21 Jan 2022 • Pichao Wang, Fan Wang, Hao Li
During the KD process, the TCL loss transfers the local structure, exploits the higher order information, and mitigates the misalignment of the heterogeneous output of teacher and student networks.
1 code implementation • 23 Dec 2021 • Jingkai Zhou, Pichao Wang, Fan Wang, Qiong Liu, Hao Li, Rong Jin
Self-attention is powerful in modeling long-range dependencies, but it is weak in local finer-level feature learning.
Ranked #46 on Semantic Segmentation on ADE20K val
1 code implementation • CVPR 2022 • Benjia Zhou, Pichao Wang, Jun Wan, Yanyan Liang, Fan Wang, Du Zhang, Zhen Lei, Hao Li, Rong Jin
Decoupling spatiotemporal representation refers to decomposing the spatial and temporal features into dimension-independent factors.
Ranked #1 on Hand Gesture Recognition on NVGesture
1 code implementation • 2 Dec 2021 • Zhaoyuan Yin, Pichao Wang, Fan Wang, Xianzhe Xu, Hanling Zhang, Hao Li, Rong Jin
Unsupervised semantic segmentation aims to obtain high-level semantic representation on low-level visual features without manual annotations.
Ranked #2 on Unsupervised Semantic Segmentation on COCO-Stuff-171 (using extra training data)
1 code implementation • CVPR 2022 • Wenhao Li, Hong Liu, Hao Tang, Pichao Wang, Luc van Gool
Estimating 3D human poses from monocular videos is a challenging task due to depth ambiguity and self-occlusion.
Ranked #22 on 3D Human Pose Estimation on MPI-INF-3DHP
2 code implementations • 23 Nov 2021 • Hao Luo, Pichao Wang, Yi Xu, Feng Ding, Yanxin Zhou, Fan Wang, Hao Li, Rong Jin
We first investigate self-supervised learning (SSL) methods with Vision Transformer (ViT) pretrained on unlabelled person images (the LUPerson dataset), and empirically find it significantly surpasses ImageNet supervised pre-training models on ReID tasks.
Ranked #1 on Unsupervised Person Re-Identification on Market-1501 (using extra training data)
2 code implementations • ICLR 2022 • Tongkun Xu, Weihua Chen, Pichao Wang, Fan Wang, Hao Li, Rong Jin
Along with the pseudo labels, a weight-sharing triple-branch transformer framework is proposed to apply self-attention and cross-attention for source/target feature learning and source-target domain alignment, respectively.
Ranked #3 on Domain Adaptation on Office-31
no code implementations • 8 Sep 2021 • Pichao Wang, Xue Wang, Hao Luo, Jingkai Zhou, Zhipeng Zhou, Fan Wang, Hao Li, Rong Jin
In this paper, we further investigate this problem and extend the above conclusion: only early convolutions do not help for stable training, but the scaled ReLU operation in the \textit{convolutional stem} (\textit{conv-stem}) matters.
1 code implementation • 28 May 2021 • Pichao Wang, Xue Wang, Fan Wang, Ming Lin, Shuning Chang, Hao Li, Rong Jin
A key component in vision transformers is the fully-connected self-attention which is more powerful than CNNs in modelling long range dependencies.
no code implementations • 15 Apr 2021 • Zitong Yu, Xiaobai Li, Pichao Wang, Guoying Zhao
3D mask face presentation attack detection (PAD) plays a vital role in securing face recognition systems from emergent 3D mask attacks.
no code implementations • 30 Mar 2021 • Shuning Chang, Pichao Wang, Fan Wang, Hao Li, Jiashi Feng
Temporal action proposal generation (TAPG) is a fundamental and challenging task in video understanding, especially in temporal action detection.
1 code implementation • 26 Mar 2021 • Wenhao Li, Hong Liu, Runwei Ding, Mengyuan Liu, Pichao Wang, Wenming Yang
The modified VTE is termed as Strided Transformer Encoder (STE), which is built upon the outputs of VTE.
Ranked #2 on 3D Human Pose Estimation on HumanEva-I
4 code implementations • ICCV 2021 • Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, Wei Jiang
Extracting robust feature representation is one of the key challenges in object re-identification (ReID).
Ranked #1 on Person Re-Identification on Market-1501-C
2 code implementations • 1 Feb 2021 • Ming Lin, Pichao Wang, Zhenhong Sun, Hesen Chen, Xiuyu Sun, Qi Qian, Hao Li, Rong Jin
Comparing with previous NAS methods, the proposed Zen-NAS is magnitude times faster on multiple server-side and mobile-side GPU platforms with state-of-the-art accuracy on ImageNet.
Ranked #2 on Neural Architecture Search on ImageNet
no code implementations • 5 Jan 2021 • Xiangyu Li, Yonghong Hou, Pichao Wang, Zhimin Gao, Mingliang Xu, Wanqing Li
In this paper, we propose a \textbf{Tr}ansformer-based RGB-D \textbf{e}gocentric \textbf{a}ction \textbf{r}ecognition framework, called Trear.
2 code implementations • ICCV 2021 • Ming Lin, Pichao Wang, Zhenhong Sun, Hesen Chen, Xiuyu Sun, Qi Qian, Hao Li, Rong Jin
To address this issue, instead of using an accuracy predictor, we propose a novel zero-shot index dubbed Zen-Score to rank the architectures.
Neural Architecture Search Vocal Bursts Intensity Prediction
no code implementations • 8 Dec 2020 • Xiangyu Li, Yonghong Hou, Pichao Wang, Zhimin Gao, Mingliang Xu, Wanqing Li
In this paper, we propose a method consisting of two camera pose estimators that deal with the information from pairwise images and a short sequence of images respectively.
no code implementations • 29 Oct 2020 • Haoyuan Zhang, Yonghong Hou, Pichao Wang, Zihui Guo, Wanqing Li
The recently developed DARTS (Differentiable Architecture Search) is adopted to search for an effective network architecture that is built upon the two types of cells.
1 code implementation • 21 Aug 2020 • Zitong Yu, Benjia Zhou, Jun Wan, Pichao Wang, Haoyu Chen, Xin Liu, Stan Z. Li, Guoying Zhao
Gesture recognition has attracted considerable attention owing to its great potential in applications.
no code implementations • 21 Feb 2020 • Jingkun Gao, Xiaomin Song, Qingsong Wen, Pichao Wang, Liang Sun, Huan Xu
It is deployed as a public online service and widely adopted in different business scenarios at Alibaba Group.
no code implementations • 17 Mar 2018 • Pichao Wang, Wanqing Li, Zhimin Gao, Chang Tang, Philip Ogunbona
This paper proposes three simple, compact yet effective representations of depth sequences, referred to respectively as Dynamic Depth Images (DDI), Dynamic Depth Normal Images (DDNI) and Dynamic Depth Motion Normal Images (DDMNI), for both isolated and continuous action recognition.
no code implementations • 5 Dec 2017 • Pichao Wang, Wanqing Li, Jun Wan, Philip Ogunbona, Xinwang Liu
Differently from the conventional ConvNet that learns the deep separable features for homogeneous modality-based classification with only one softmax loss function, the c-ConvNet enhances the discriminative power of the deeply learned features and weakens the undesired modality discrepancy by jointly optimizing a ranking loss and a softmax loss for both homogeneous and heterogeneous modalities.
no code implementations • 31 Oct 2017 • Pichao Wang, Wanqing Li, Philip Ogunbona, Jun Wan, Sergio Escalera
Specifically, deep learning methods based on the CNN and RNN architectures have been adopted for motion recognition using RGB-D data.
no code implementations • 6 Jul 2017 • Chuankun Li, Pichao Wang, Shuang Wang, Yonghong Hou, Wanqing Li
Recent methods based on 3D skeleton data have achieved outstanding performance due to its conciseness, robustness, and view-independent representation.
1 code implementation • 2 May 2017 • Zewei Ding, Pichao Wang, Philip O. Ogunbona, Wanqing Li
The proposed method achieved state-of-the-art performance on NTU RGB+D dataset for 3D human action analysis.
Ranked #105 on Skeleton Based Action Recognition on NTU RGB+D (Accuracy (CV) metric)
no code implementations • CVPR 2017 • Pichao Wang, Wanqing Li, Zhimin Gao, Yuyao Zhang, Chang Tang, Philip Ogunbona
Based on the scene flow vectors, we propose a new representation, namely, Scene Flow to Action Map (SFAM), that describes several long term spatio-temporal dynamics for action recognition.
Ranked #3 on Hand Gesture Recognition on ChaLearn val
no code implementations • 7 Jan 2017 • Pichao Wang, Wanqing Li, Song Liu, Zhimin Gao, Chang Tang, Philip Ogunbona
This paper proposes three simple, compact yet effective representations of depth sequences, referred to respectively as Dynamic Depth Images (DDI), Dynamic Depth Normal Images (DDNI) and Dynamic Depth Motion Normal Images (DDMNI).
Ranked #2 on Hand Gesture Recognition on ChaLearn val
no code implementations • 30 Dec 2016 • Pichao Wang, Wanqing Li, Chuankun Li, Yonghong Hou
Convolutional Neural Networks (ConvNets) have recently shown promising performance in many computer vision tasks, especially image-based recognition.
Ranked #1 on Skeleton Based Action Recognition on Gaming 3D (G3D)
no code implementations • 8 Nov 2016 • Pichao Wang, Zhaoyang Li, Yonghong Hou, Wanqing Li
Recently, Convolutional Neural Networks (ConvNets) have shown promising performances in many computer vision tasks, especially image-based recognition.
no code implementations • 22 Aug 2016 • Pichao Wang, Wanqing Li, Song Liu, Yuyao Zhang, Zhimin Gao, Philip Ogunbona
This paper addresses the problem of continuous gesture recognition from sequences of depth maps using convolutional neutral networks (ConvNets).
no code implementations • 1 Feb 2016 • Pichao Wang, Zhaoyang Li, Yonghong Hou, Wanqing Li
This paper proposes a new framework for RGB-D-based action recognition that takes advantages of hand-designed features from skeleton data and deeply learned features from depth maps, and exploits effectively both the local and global temporal information.
no code implementations • 21 Jan 2016 • Jing Zhang, Wanqing Li, Philip O. Ogunbona, Pichao Wang, Chang Tang
Human action recognition from RGB-D (Red, Green, Blue and Depth) data has attracted increasing attention since the first work reported in 2010.
no code implementations • 10 Nov 2015 • Chang Tang, Pichao Wang, Wanqing Li
This paper presents a fast yet effective method to recognize actions from stream of noisy skeleton data, and a novel weighted covariance descriptor is adopted to accumulate evidence.
no code implementations • IEEE Transactions on Human-Machine Systems 2016 2015 • Pichao Wang, Wanqing Li, Zhimin Gao, Jing Zhang, Chang Tang, Philip Ogunbona
In addition, the method was evaluated on the large dataset constructed from the above datasets.
Ranked #9 on Multimodal Activity Recognition on EV-Action
no code implementations • 20 Jan 2015 • Pichao Wang, Wanqing Li, Zhimin Gao, Jing Zhang, Chang Tang, Philip Ogunbona
The results show that our approach can achieve state-of-the-art results on the individual datasets and without dramatical performance degradation on the Combined Dataset.
no code implementations • 14 Sep 2014 • Pichao Wang, Wanqing Li, Philip Ogunbona, Zhimin Gao, Hanling Zhang
These parts are referred to as Frequent Local Parts or FLPs.