no code implementations • 28 May 2024 • Shentong Mo, Sukmin Yun
To be specific, the proposed DMT-JEPA (a) computes feature similarities between each masked patch and its corresponding neighboring patches to select patches having semantically meaningful relations, and (b) employs lightweight cross-attention heads to aggregate features of neighboring patches as the masked targets.
no code implementations • 24 May 2024 • Shentong Mo, Yapeng Tian
Traditional diffusion transformers (DiT), which utilize self-attention blocks, are effective but their computational complexity scales quadratically with the input length, limiting their use for high-resolution images.
no code implementations • 12 May 2024 • Shentong Mo, Haofan Wang, Huaxia Li, Xu Tang
Video-language pre-training is a typical and challenging problem that aims at learning visual and textual representations from large-scale data in a self-supervised way.
no code implementations • 19 Apr 2024 • Shentong Mo, Xufang Luo, Yansen Wang, Dongsheng Li
Visual task adaptation has been demonstrated to be effective in adapting pre-trained Vision Transformers (ViTs) to general downstream visual tasks using specialized learnable layers or tokens.
1 code implementation • 31 Mar 2024 • Jiantao Wu, Shentong Mo, Sara Atito, ZhenHua Feng, Josef Kittler, Muhammad Awais
Recently, masked image modeling (MIM), an important self-supervised learning (SSL) method, has drawn attention for its effectiveness in learning data representation from unlabeled data.
no code implementations • 8 Mar 2024 • Shentong Mo, Jing Shi, Yapeng Tian
Extensive evaluations on the AudioCaps and T2AV-Bench demonstrate that our T2AV sets a new standard for video-aligned TTA generation in ensuring visual alignment and temporal consistency.
no code implementations • 8 Mar 2024 • Lin Zhang, Shentong Mo, Yijing Zhang, Pedro Morgado
We hope our established benchmark can open new avenues for controllable visual generation.
no code implementations • 27 Feb 2024 • Shentong Mo, Yansen Wang, Xufang Luo, Dongsheng Li
Visual Prompt Tuning (VPT) techniques have gained prominence for their capacity to adapt pre-trained Vision Transformers (ViTs) to downstream visual tasks using specialized learnable tokens termed as prompts.
no code implementations • 22 Feb 2024 • Miao Xin, Zhongrui You, Zihan Zhang, Taoran Jiang, Tingjia Xu, Haotian Liang, Guojing Ge, Yuchen Ji, Shentong Mo, Jian Cheng
We present SpaceAgents-1, a system for learning human and multi-robot collaboration (HMRC) strategies under microgravity conditions.
no code implementations • 12 Dec 2023 • Shentong Mo, Enze Xie, Yue Wu, Junsong Chen, Matthias Nießner, Zhenguo Li
Motivated by the inherent redundancy of 3D compared to 2D, we propose FastDiT-3D, a novel masked diffusion transformer tailored for efficient 3D point cloud generation, which greatly reduces training costs.
no code implementations • 2 Dec 2023 • Shuxian Zou, Hui Li, Shentong Mo, Xingyi Cheng, Eric Xing, Le Song
Predicting the structure of interacting chains is crucial for understanding biological systems and developing new drugs.
1 code implementation • 2 Dec 2023 • Shentong Mo, Pedro Morgado
Thus, to address the computational complexity, we propose an alternative procedure that factorizes the local representations before representing audio-visual interactions.
no code implementations • 2 Dec 2023 • Jiantao Wu, Shentong Mo, Sara Atito, Josef Kittler, ZhenHua Feng, Muhammad Awais
Recently, self-supervised metric learning has raised attention for the potential to learn a generic distance function.
no code implementations • 10 Nov 2023 • Shentong Mo, Paul Pu Liang, Russ Salakhutdinov, Louis-Philippe Morency
The Internet of Things (IoT), the network integrating billions of smart physical devices embedded with sensors, software, and communication technologies for the purpose of connecting and exchanging data with other devices and systems, is a critical and rapidly expanding component of our modern world.
no code implementations • 28 Oct 2023 • Shentong Mo, Zhun Sun, Chao Li
Data augmentation has become a standard component of vision pre-trained models to capture the invariance between augmented views.
no code implementations • 14 Sep 2023 • Shentong Mo, Miao Xin
These local uncertainties, intrinsic to LLMs given their potential for diverse responses, remain a significant concern in the reasoning process.
1 code implementation • ICCV 2023 • Shentong Mo, Weiguo Pian, Yapeng Tian
Our CIGN leverages learnable audio-visual class tokens and audio-visual grouping to continually aggregate class-aware features.
no code implementations • 22 Aug 2023 • Jiantao Wu, Shentong Mo, Muhammad Awais, Sara Atito, ZhenHua Feng, Josef Kittler
Self-supervised pretraining (SSP) has emerged as a popular technique in machine learning, enabling the extraction of meaningful feature representations without labelled data.
1 code implementation • ICCV 2023 • Weiguo Pian, Shentong Mo, Yunhui Guo, Yapeng Tian
We demonstrate that joint audio-visual modeling can improve class-incremental learning, but current methods fail to preserve semantic similarity between audio and visual features as incremental step grows.
1 code implementation • NeurIPS 2023 • Shentong Mo, Enze Xie, Ruihang Chu, Lewei Yao, Lanqing Hong, Matthias Nießner, Zhenguo Li
Recent Diffusion Transformers (e. g., DiT) have demonstrated their powerful effectiveness in generating high-quality 2D images.
Ranked #1 on Point Cloud Generation on ShapeNet Car
1 code implementation • 30 May 2023 • Shentong Mo, Pedro Morgado
The ability to accurately recognize, localize and separate sound sources is fundamental to any audio-visual perception task.
no code implementations • 22 May 2023 • Shentong Mo, Jing Shi, Yapeng Tian
In this work, we propose a novel and personalized text-to-sound generation approach with visual alignment based on latent diffusion models, namely DiffAVA, that can simply fine-tune lightweight visual-text alignment modules with frozen modality-specific encoders to update visual-aligned text embeddings as the condition.
no code implementations • 3 May 2023 • Shentong Mo, Yapeng Tian
In this work, we propose a simple yet effective audio-visual localization and segmentation framework based on the Segment Anything Model, namely AV-SAM, that can generate sounding object masks corresponding to the audio.
no code implementations • 10 Apr 2023 • Shentong Mo, Jingfei Xia, Ihor Markevych
Visual and linguistic pre-training aims to learn vision and language representations together, which can be transferred to visual-linguistic downstream tasks.
1 code implementation • CVPR 2023 • Shentong Mo, Yapeng Tian
Sound source localization is a typical and challenging task that predicts the location of sound sources in a video.
1 code implementation • 22 Mar 2023 • Jiantao Wu, Shentong Mo, Muhammad Awais, Sara Atito, Xingshen Zhang, Lin Wang, Xiang Yang
One major challenge of disentanglement learning with variational autoencoders is the trade-off between disentanglement and reconstruction fidelity.
no code implementations • 18 Oct 2022 • Shentong Mo, Zhun Sun, Chao Li
Particularly, in the classification down-stream tasks with linear probes, our proposed method outperforms the state-of-the-art instance-wise and prototypical contrastive learning methods on the ImageNet-100 dataset by 2. 96% and the ImageNet-1K dataset by 2. 46% under the same settings of batch size and epochs.
1 code implementation • 30 Aug 2022 • Shentong Mo, Pedro Morgado
We also propose a new approach for visual sound source localization that addresses both these problems.
no code implementations • 18 Aug 2022 • Shentong Mo, Zhun Sun, Chao Li
One of the drawbacks of CSL is that the loss term requires a large number of negative samples to provide better mutual information bound ideally.
no code implementations • 28 May 2022 • Jiantao Wu, Shentong Mo
Furthermore, we investigate the inter-object and intra-object relationship and find that the latter is crucial for self-supervised pre-training.
no code implementations • 1 Apr 2022 • Fangyi Chen, Han Zhang, Zaiwang Li, Jiachen Dou, Shentong Mo, Hao Chen, Yongxin Zhang, Uzair Ahmed, Chenchen Zhu, Marios Savvides
To make full use of computer vision technology in stores, it is required to consider the actual needs that fit the characteristics of the retail scene.
Ranked #1 on Dense Object Detection on SKU-110K
no code implementations • 20 Mar 2022 • Shentong Mo, Jingfei Xia, Xiaoqing Tan, Bhiksha Raj
Our Point3D consists of a Point Head for action localization and a 3D Head for action classification.
1 code implementation • 17 Mar 2022 • Shentong Mo, Pedro Morgado
Unsupervised audio-visual source localization aims at localizing visible sound sources in a video without relying on ground-truth localization for training.
no code implementations • 8 Mar 2022 • Shentong Mo, Daizong Liu, Wei Hu
Secondly, since some predicted frames (i. e., boundary frames) are relatively coarse and exhibit similar appearance to their adjacent frames, we propose a coarse-to-fine contrastive learning paradigm to learn more discriminative frame-wise representations for distinguishing the false positive frames.
1 code implementation • 2 Mar 2022 • Paul Pu Liang, Yiwei Lyu, Xiang Fan, Jeffrey Tsaw, Yudong Liu, Shentong Mo, Dani Yogatama, Louis-Philippe Morency, Ruslan Salakhutdinov
Many real-world problems are inherently multimodal, from spoken language, gestures, and paralinguistics humans use to communicate, to force, proprioception, and visual sensors on robots.
6 code implementations • 7 Feb 2022 • Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, Jingdong Wang
The pretraining tasks include two tasks: masked representation prediction - predict the representations for the masked patches, and masked patch reconstruction - reconstruct the masked patches.
no code implementations • 11 Oct 2021 • Shentong Mo, Xi Fu, Chenyang Hong, Yizhen Chen, Yuxuan Zheng, Xiangru Tang, Zhiqiang Shen, Eric P Xing, Yanyan Lan
The core problem is to model how regulatory elements interact with each other and its variability across different cell types.
no code implementations • 29 Sep 2021 • Jingwei Liu, Yi Gu, Shentong Mo, Zhun Sun, Shumin Han, Jiafeng Guo, Xueqi Cheng
In self-supervised learning frameworks, deep networks are optimized to align different views of an instance that contains the similar visual semantic information.
no code implementations • 29 Sep 2021 • Shentong Mo, Zhun Sun, Shumin Han
Recent works apply the contrastive learning on the discriminator of the Generative Adversarial Networks, and there exists little work on exploring if contrastive learning can be applied on encoders to learn disentangled representations.
no code implementations • NeurIPS Workshop AI4Scien 2021 • Shentong Mo, Xi Fu, Chenyang Hong, Yizhen Chen, Yuxuan Zheng, Xiangru Tang, Yanyan Lan, Zhiqiang Shen, Eric Xing
In this work, we propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
no code implementations • 22 Sep 2021 • Shentong Mo, Pengtao Xie
Learning by examples, which learns to solve a new problem by looking into how similar problems are solved, is an effective learning method in human learning.
1 code implementation • 15 Dec 2020 • Shentong Mo, Xiaoqing Tan, Jingfei Xia, Pinxu Ren
Spatiotemporal action recognition deals with locating and classifying actions in videos.
1 code implementation • 15 Dec 2020 • Shentong Mo, Haofan Wang, Pinxu Ren, Ta-Chung Chi
Automatic speech verification (ASV) is the technology to determine the identity of a person based on their voice.