1 code implementation • 12 Jul 2022 • Jiashuo Yu, Jinyu Liu, Ying Cheng, Rui Feng, Yuejie Zhang
In this paper, we analyze the modality asynchrony and undifferentiated instances phenomena of the multiple instance learning (MIL) procedure, and further investigate its negative impact on weakly-supervised audio-visual learning.
Anomaly Detection In Surveillance Videos audio-visual learning +1
1 code implementation • 12 Jul 2022 • Xinyu Huang, Youcai Zhang, Ying Cheng, Weiwei Tian, RuiWei Zhao, Rui Feng, Yuejie Zhang, Yaqian Li, Yandong Guo, Xiaobo Zhang
However, the image-text pairs co-occurrent on the Internet typically lack explicit alignment information, which is suboptimal for VLP.
no code implementations • 7 Jul 2022 • Jiashuo Yu, Junfu Pu, Ying Cheng, Rui Feng, Ying Shan
Although audio-visual representation has been proved to be applicable in many downstream tasks, the representation of dancing videos, which is more specific and always accompanied by music with complex auditory contents, remains challenging and uninvestigated.
no code implementations • 10 Apr 2022 • Jinyu Liu, Ying Cheng, Yuejie Zhang, Rui-Wei Zhao, Rui Feng
Visual-only self-supervised learning has achieved significant improvement in video representation learning.
1 code implementation • 24 Nov 2021 • Jiashuo Yu, Ying Cheng, Rui-Wei Zhao, Rui Feng, Yuejie Zhang
Recognizing and localizing events in videos is a fundamental task for video understanding.
no code implementations • 1 Sep 2021 • Xi Long, Ying Cheng, Xiao Mu, Lian Liu, Jingxin Liu
We present a summary of the domain adaptive cascade R-CNN method for mitosis detection of digital histopathology images.
no code implementations • 7 Apr 2021 • Jiashuo Yu, Ying Cheng, Rui Feng
The localization subnetwork consists of Multimodal Bottleneck Attention Module (MBAM), which is designed to extract fine-grained segment-level contents.
no code implementations • 13 Aug 2020 • Ying Cheng, Ruize Wang, Zhihao Pan, Rui Feng, Yuejie Zhang
When watching videos, the occurrence of a visual event is often accompanied by an audio event, e. g., the voice of lip motion, the music of playing instruments.
no code implementations • COLING 2020 • Ruize Wang, Zhongyu Wei, Ying Cheng, Piji Li, Haijun Shan, Ji Zhang, Qi Zhang, Xuanjing Huang
Visual storytelling aims to generate a narrative paragraph from a sequence of images automatically.
Ranked #9 on Visual Storytelling on VIST