Search Results for author: Shentong Mo

Found 43 papers, 14 papers with code

DMT-JEPA: Discriminative Masked Targets for Joint-Embedding Predictive Architecture

no code implementations • 28 May 2024 • Shentong Mo, Sukmin Yun

To be specific, the proposed DMT-JEPA (a) computes feature similarities between each masked patch and its corresponding neighboring patches to select patches having semantically meaningful relations, and (b) employs lightweight cross-attention heads to aggregate features of neighboring patches as the masked targets.

Paper
Add Code

Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation

no code implementations • 24 May 2024 • Shentong Mo, Yapeng Tian

Traditional diffusion transformers (DiT), which utilize self-attention blocks, are effective but their computational complexity scales quadratically with the input length, limiting their use for high-resolution images.

Paper
Add Code

Unified Video-Language Pre-training with Synchronized Audio

no code implementations • 12 May 2024 • Shentong Mo, Haofan Wang, Huaxia Li, Xu Tang

Video-language pre-training is a typical and challenging problem that aims at learning visual and textual representations from large-scale data in a self-supervised way.

Paper
Add Code

A Large-scale Medical Visual Task Adaptation Benchmark

no code implementations • 19 Apr 2024 • Shentong Mo, Xufang Luo, Yansen Wang, Dongsheng Li

Visual task adaptation has been demonstrated to be effective in adapting pre-trained Vision Transformers (ViTs) to general downstream visual tasks using specialized learnable layers or tokens.

Paper
Add Code

DailyMAE: Towards Pretraining Masked Autoencoders in One Day

1 code implementation • 31 Mar 2024 • Jiantao Wu, Shentong Mo, Sara Atito, ZhenHua Feng, Josef Kittler, Muhammad Awais

Recently, masked image modeling (MIM), an important self-supervised learning (SSL) method, has drawn attention for its effectiveness in learning data representation from unlabeled data.

Self-Supervised Learning

Paper
Code

Text-to-Audio Generation Synchronized with Videos

no code implementations • 8 Mar 2024 • Shentong Mo, Jing Shi, Yapeng Tian

Extensive evaluations on the AudioCaps and T2AV-Bench demonstrate that our T2AV sets a new standard for video-aligned TTA generation in ensuring visual alignment and temporal consistency.

AudioCaps Audio Generation +1

Paper
Add Code

Audio-Synchronized Visual Animation

no code implementations • 8 Mar 2024 • Lin Zhang, Shentong Mo, Yijing Zhang, Pedro Morgado

We hope our established benchmark can open new avenues for controllable visual generation.

Paper
Add Code

LSPT: Long-term Spatial Prompt Tuning for Visual Representation Learning

no code implementations • 27 Feb 2024 • Shentong Mo, Yansen Wang, Xufang Luo, Dongsheng Li

Visual Prompt Tuning (VPT) techniques have gained prominence for their capacity to adapt pre-trained Vision Transformers (ViTs) to downstream visual tasks using specialized learnable tokens termed as prompts.

Representation Learning Visual Prompt Tuning

Paper
Add Code

We Choose to Go to Space: Agent-driven Human and Multi-Robot Collaboration in Microgravity

no code implementations • 22 Feb 2024 • Miao Xin, Zhongrui You, Zihan Zhang, Taoran Jiang, Tingjia Xu, Haotian Liang, Guojing Ge, Yuchen Ji, Shentong Mo, Jian Cheng

We present SpaceAgents-1, a system for learning human and multi-robot collaboration (HMRC) strategies under microgravity conditions.

Decision Making

Paper
Add Code

Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation

no code implementations • 12 Dec 2023 • Shentong Mo, Enze Xie, Yue Wu, Junsong Chen, Matthias Nießner, Zhenguo Li

Motivated by the inherent redundancy of 3D compared to 2D, we propose FastDiT-3D, a novel masked diffusion transformer tailored for efficient 3D point cloud generation, which greatly reduces training costs.

3D Generation Denoising +1

Paper
Add Code

Linker-Tuning: Optimizing Continuous Prompts for Heterodimeric Protein Prediction

no code implementations • 2 Dec 2023 • Shuxian Zou, Hui Li, Shentong Mo, Xingyi Cheng, Eric Xing, Le Song

Predicting the structure of interacting chains is crucial for understanding biological systems and developing new drugs.

Protein Structure Prediction

Paper
Add Code

Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling

1 code implementation • 2 Dec 2023 • Shentong Mo, Pedro Morgado

Thus, to address the computational complexity, we propose an alternative procedure that factorizes the local representations before representing audio-visual interactions.

Paper
Code

Beyond Accuracy: Statistical Measures and Benchmark for Evaluation of Representation from Self-Supervised Learning

no code implementations • 2 Dec 2023 • Jiantao Wu, Shentong Mo, Sara Atito, Josef Kittler, ZhenHua Feng, Muhammad Awais

Recently, self-supervised metric learning has raised attention for the potential to learn a generic distance function.

16k Metric Learning +1

Paper
Add Code

MultiIoT: Towards Large-scale Multisensory Learning for the Internet of Things

no code implementations • 10 Nov 2023 • Shentong Mo, Paul Pu Liang, Russ Salakhutdinov, Louis-Philippe Morency

The Internet of Things (IoT), the network integrating billions of smart physical devices embedded with sensors, software, and communication technologies for the purpose of connecting and exchanging data with other devices and systems, is a critical and rapidly expanding component of our modern world.

Representation Learning

Paper
Add Code

Exploring Data Augmentations on Self-/Semi-/Fully- Supervised Pre-trained Models

no code implementations • 28 Oct 2023 • Shentong Mo, Zhun Sun, Chao Li

Data augmentation has become a standard component of vision pre-trained models to capture the invariance between augmented views.

Data Augmentation Image Classification +4

Paper
Add Code

Tree of Uncertain Thoughts Reasoning for Large Language Models

no code implementations • 14 Sep 2023 • Shentong Mo, Miao Xin

These local uncertainties, intrinsic to LLMs given their potential for diverse responses, remain a significant concern in the reasoning process.

Decision Making Response Generation +1

Paper
Add Code

Class-Incremental Grouping Network for Continual Audio-Visual Learning

1 code implementation • ICCV 2023 • Shentong Mo, Weiguo Pian, Yapeng Tian

Our CIGN leverages learnable audio-visual class tokens and audio-visual grouping to continually aggregate class-aware features.

audio-visual learning Class Incremental Learning +2

Paper
Code

Masked Momentum Contrastive Learning for Zero-shot Semantic Understanding

no code implementations • 22 Aug 2023 • Jiantao Wu, Shentong Mo, Muhammad Awais, Sara Atito, ZhenHua Feng, Josef Kittler

Self-supervised pretraining (SSP) has emerged as a popular technique in machine learning, enabling the extraction of meaningful feature representations without labelled data.

Contrastive Learning Object +6

Paper
Add Code

Audio-Visual Class-Incremental Learning

1 code implementation • ICCV 2023 • Weiguo Pian, Shentong Mo, Yunhui Guo, Yapeng Tian

We demonstrate that joint audio-visual modeling can improve class-incremental learning, but current methods fail to preserve semantic similarity between audio and visual features as incremental step grows.

Class Incremental Learning Incremental Learning +3

Paper
Code

DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation

1 code implementation • NeurIPS 2023 • Shentong Mo, Enze Xie, Ruihang Chu, Lewei Yao, Lanqing Hong, Matthias Nießner, Zhenguo Li

Recent Diffusion Transformers (e. g., DiT) have demonstrated their powerful effectiveness in generating high-quality 2D images.

Ranked #1 on Point Cloud Generation on ShapeNet Car

3D Shape Generation Denoising +2

165

Paper
Code

A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition

1 code implementation • 30 May 2023 • Shentong Mo, Pedro Morgado

The ability to accurately recognize, localize and separate sound sources is fundamental to any audio-visual perception task.

audio-visual learning

Paper
Code

DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment

no code implementations • 22 May 2023 • Shentong Mo, Jing Shi, Yapeng Tian

In this work, we propose a novel and personalized text-to-sound generation approach with visual alignment based on latent diffusion models, namely DiffAVA, that can simply fine-tune lightweight visual-text alignment modules with frozen modality-specific encoders to update visual-aligned text embeddings as the condition.

AudioCaps Audio Generation +1

Paper
Add Code

AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation

no code implementations • 3 May 2023 • Shentong Mo, Yapeng Tian

In this work, we propose a simple yet effective audio-visual localization and segmentation framework based on the Segment Anything Model, namely AV-SAM, that can generate sounding object masks corresponding to the audio.

Decoder Object Localization +2

Paper
Add Code

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

no code implementations • 10 Apr 2023 • Shentong Mo, Jingfei Xia, Ihor Markevych

Visual and linguistic pre-training aims to learn vision and language representations together, which can be transferred to visual-linguistic downstream tasks.

Image Retrieval Phrase Grounding +6

Paper
Add Code

Audio-Visual Grouping Network for Sound Localization from Mixtures

1 code implementation • CVPR 2023 • Shentong Mo, Yapeng Tian

Sound source localization is a typical and challenging task that predicts the location of sound sources in a video.

Object Localization

Paper
Code

Variantional autoencoder with decremental information bottleneck for disentanglement

1 code implementation • 22 Mar 2023 • Jiantao Wu, Shentong Mo, Muhammad Awais, Sara Atito, Xingshen Zhang, Lin Wang, Xiang Yang

One major challenge of disentanglement learning with variational autoencoders is the trade-off between disentanglement and reconstruction fidelity.

Disentanglement

Paper
Code

Rethinking Prototypical Contrastive Learning through Alignment, Uniformity and Correlation

no code implementations • 18 Oct 2022 • Shentong Mo, Zhun Sun, Chao Li

Particularly, in the classification down-stream tasks with linear probes, our proposed method outperforms the state-of-the-art instance-wise and prototypical contrastive learning methods on the ImageNet-100 dataset by 2. 96% and the ImageNet-1K dataset by 2. 46% under the same settings of batch size and epochs.

Contrastive Learning Self-Supervised Learning

Paper
Add Code

A Closer Look at Weakly-Supervised Audio-Visual Source Localization

1 code implementation • 30 Aug 2022 • Shentong Mo, Pedro Morgado

We also propose a new approach for visual sound source localization that addresses both these problems.

Paper
Code

Siamese Prototypical Contrastive Learning

no code implementations • 18 Aug 2022 • Shentong Mo, Zhun Sun, Chao Li

One of the drawbacks of CSL is that the loss term requires a large number of negative samples to provide better mutual information bound ideally.

Contrastive Learning Self-Supervised Learning

Paper
Add Code

Object-wise Masked Autoencoders for Fast Pre-training

no code implementations • 28 May 2022 • Jiantao Wu, Shentong Mo

Furthermore, we investigate the inter-object and intra-object relationship and find that the latter is crucial for self-supervised pre-training.

Image Classification Object

Paper
Add Code

Unitail: Detecting, Reading, and Matching in Retail Scene

no code implementations • 1 Apr 2022 • Fangyi Chen, Han Zhang, Zaiwang Li, Jiachen Dou, Shentong Mo, Hao Chen, Yongxin Zhang, Uzair Ahmed, Chenchen Zhu, Marios Savvides

To make full use of computer vision technology in stores, it is required to consider the actual needs that fit the characteristics of the retail scene.

Ranked #1 on Dense Object Detection on SKU-110K

Benchmarking Dense Object Detection +2

Paper
Add Code

Point3D: tracking actions as moving points with 3D CNNs

no code implementations • 20 Mar 2022 • Shentong Mo, Jingfei Xia, Xiaoqing Tan, Bhiksha Raj

Our Point3D consists of a Point Head for action localization and a 3D Head for action classification.

Action Classification Action Localization +1

Paper
Add Code

Localizing Visual Sounds the Easy Way

1 code implementation • 17 Mar 2022 • Shentong Mo, Pedro Morgado

Unsupervised audio-visual source localization aims at localizing visible sound sources in a video without relying on ground-truth localization for training.

Paper
Code

Multi-Scale Self-Contrastive Learning with Hard Negative Mining for Weakly-Supervised Query-based Video Grounding

no code implementations • 8 Mar 2022 • Shentong Mo, Daizong Liu, Wei Hu

Secondly, since some predicted frames (i. e., boundary frames) are relatively coarse and exhibit similar appearance to their adjacent frames, we propose a coarse-to-fine contrastive learning paradigm to learn more discriminative frame-wise representations for distinguishing the false positive frames.

Contrastive Learning Sentence +2

Paper
Add Code

High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning

1 code implementation • 2 Mar 2022 • Paul Pu Liang, Yiwei Lyu, Xiang Fan, Jeffrey Tsaw, Yudong Liu, Shentong Mo, Dani Yogatama, Louis-Philippe Morency, Ruslan Salakhutdinov

Many real-world problems are inherently multimodal, from spoken language, gestures, and paralinguistics humans use to communicate, to force, proprioception, and visual sensors on robots.

Representation Learning Time Series Analysis +2

Paper
Code

Context Autoencoder for Self-Supervised Representation Learning

6 code implementations • 7 Feb 2022 • Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, Jingdong Wang

The pretraining tasks include two tasks: masked representation prediction - predict the representations for the masked patches, and masked patch reconstruction - reconstruct the masked patches.

Ranked #14 on Self-Supervised Image Classification on ImageNet (finetuned)

Decoder Instance Segmentation +6

3,109

Paper
Code

Multi-modal Self-supervised Pre-training for Regulatory Genome Across Cell Types

no code implementations • 11 Oct 2021 • Shentong Mo, Xi Fu, Chenyang Hong, Yizhen Chen, Yuxuan Zheng, Xiangru Tang, Zhiqiang Shen, Eric P Xing, Yanyan Lan

The core problem is to model how regulatory elements interact with each other and its variability across different cell types.

Paper
Add Code

Piecing and Chipping: An effective solution for the information-erasing view generation in Self-supervised Learning

no code implementations • 29 Sep 2021 • Jingwei Liu, Yi Gu, Shentong Mo, Zhun Sun, Shumin Han, Jiafeng Guo, Xueqi Cheng

In self-supervised learning frameworks, deep networks are optimized to align different views of an instance that contains the similar visual semantic information.

Data Augmentation Self-Supervised Learning

Paper
Add Code

Representation Disentanglement in Generative Models with Contrastive Learning

no code implementations • 29 Sep 2021 • Shentong Mo, Zhun Sun, Shumin Han

Recent works apply the contrastive learning on the discriminator of the Generative Adversarial Networks, and there exists little work on exploring if contrastive learning can be applied on encoders to learn disentangled representations.

Contrastive Learning Disentanglement +1

Paper
Add Code

Multi-modal Self-supervised Pre-training for Large-scale Genome Data

no code implementations • NeurIPS Workshop AI4Scien 2021 • Shentong Mo, Xi Fu, Chenyang Hong, Yizhen Chen, Yuxuan Zheng, Xiangru Tang, Yanyan Lan, Zhiqiang Shen, Eric Xing

In this work, we propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.

Paper
Add Code

Learning by Examples Based on Multi-level Optimization

no code implementations • 22 Sep 2021 • Shentong Mo, Pengtao Xie

Learning by examples, which learns to solve a new problem by looking into how similar problems are solved, is an effective learning method in human learning.

Few-Shot Learning

Paper
Add Code

Towards Improving Spatiotemporal Action Recognition in Videos

1 code implementation • 15 Dec 2020 • Shentong Mo, Xiaoqing Tan, Jingfei Xia, Pinxu Ren

Spatiotemporal action recognition deals with locating and classifying actions in videos.

Action Detection Action Localization +1

Paper
Code

Automatic Speech Verification Spoofing Detection

1 code implementation • 15 Dec 2020 • Shentong Mo, Haofan Wang, Pinxu Ren, Ta-Chung Chi

Automatic speech verification (ASV) is the technology to determine the identity of a person based on their voice.

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.