Search Results for author: Yuankai Qi

Found 34 papers, 20 papers with code

Retrieval Enhanced Zero-Shot Video Captioning

no code implementations • 11 May 2024 • Yunchuan Ma, Laiyun Qing, Guorong Li, Yuankai Qi, Quan Z. Sheng, Qingming Huang

Despite the significant progress of fully-supervised video captioning, zero-shot methods remain much less explored.

Retrieval Test-time Adaptation +3

Paper
Add Code

Generating Content for HDR Deghosting from Frequency View

no code implementations • 1 Apr 2024 • Tao Hu, Qingsen Yan, Yuankai Qi, Yanning Zhang

To address this challenge, we propose the Low-Frequency aware Diffusion (LF-Diff) model for ghost-free HDR imaging.

HDR Reconstruction regression

Paper
Add Code

Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework

1 code implementation • 12 Mar 2024 • Vu Minh Hieu Phan, Yutong Xie, Yuankai Qi, Lingqiao Liu, Liyang Liu, BoWen Zhang, Zhibin Liao, Qi Wu, Minh-Son To, Johan W. Verjans

Medical vision language pre-training (VLP) has emerged as a frontier of research, enabling zero-shot pathological recognition by comparing the query image with the textual descriptions for each disease.

Language Modelling Large Language Model

Paper
Code

StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing

no code implementations • 20 Feb 2024 • Gaoxiang Cong, Yuankai Qi, Liang Li, Amin Beheshti, Zhedong Zhang, Anton Van Den Hengel, Ming-Hsuan Yang, Chenggang Yan, Qingming Huang

It contains three main components: (1) A multimodal style adaptor operating at the phoneme level to learn pronunciation style from the reference audio, and generate intermediate representations informed by the facial emotion presented in the video; (2) An utterance-level style learning module, which guides both the mel-spectrogram decoding and the refining processes from the intermediate embeddings to improve the overall style expression; And (3) a phoneme-guided lip aligner to maintain lip sync.

Voice Cloning

Paper
Add Code

Subject-Oriented Video Captioning

no code implementations • 20 Dec 2023 • Yunchuan Ma, Chang Teng, Yuankai Qi, Guorong Li, Laiyu Qing, Qi Wu, Qingming Huang

To address this problem, we propose a new video captioning task, subject-oriented video captioning, which allows users to specify the describing target via a bounding box.

Video Captioning

Paper
Add Code

Weakly Supervised Video Individual CountingWeakly Supervised Video Individual Counting

1 code implementation • 10 Dec 2023 • Xinyan Liu, Guorong Li, Yuankai Qi, Ziheng Yan, Zhenjun Han, Anton Van Den Hengel, Ming-Hsuan Yang, Qingming Huang

% To provide a more realistic reflection of the underlying practical challenge, we introduce a weakly supervised VIC task, wherein trajectory labels are not provided.

Contrastive Learning Video Individual Counting

Paper
Code

Dynamic Erasing Network Based on Multi-Scale Temporal Features for Weakly Supervised Video Anomaly Detection

1 code implementation • 4 Dec 2023 • Chen Zhang, Guorong Li, Yuankai Qi, Hanhua Ye, Laiyun Qing, Ming-Hsuan Yang, Qingming Huang

To address these limitations, we propose a Dynamic Erasing Network (DE-Net) for weakly supervised video anomaly detection, which learns multi-scale temporal features.

Anomaly Detection Video Anomaly Detection

Paper
Code

March in Chat: Interactive Prompting for Remote Embodied Referring Expression

1 code implementation • ICCV 2023 • Yanyuan Qiao, Yuankai Qi, Zheng Yu, Jing Liu, Qi Wu

Nevertheless, this poses more challenges than other VLN tasks since it requires agents to infer a navigation plan only based on a short instruction.

Referring Expression Vision and Language Navigation

Paper
Code

AerialVLN: Vision-and-Language Navigation for UAVs

1 code implementation • ICCV 2023 • Shubo Liu, Hongsheng Zhang, Yuankai Qi, Peng Wang, Yaning Zhang, Qi Wu

Navigating in the sky is more complicated than on the ground because agents need to consider the flying height and more complex spatial relationship reasoning.

Navigate Vision and Language Navigation

Paper
Code

Mind the Gap: Improving Success Rate of Vision-and-Language Navigation by Revisiting Oracle Success Routes

no code implementations • 7 Aug 2023 • Chongyang Zhao, Yuankai Qi, Qi Wu

Vision-and-Language Navigation (VLN) aims to navigate to the target location by following a given instruction.

Navigate Vision and Language Navigation

Paper
Add Code

Teacher Agent: A Knowledge Distillation-Free Framework for Rehearsal-based Video Incremental Learning

1 code implementation • 1 Jun 2023 • Shengqin Jiang, Yaoyu Fang, Haokui Zhang, Qingshan Liu, Yuankai Qi, Yang Yang, Peng Wang

Rehearsal-based video incremental learning often employs knowledge distillation to mitigate catastrophic forgetting of previously learned data.

Incremental Learning Knowledge Distillation +1

Paper
Code

HOP+: History-enhanced and Order-aware Pre-training for Vision-and-Language Navigation

no code implementations • IEEE Transactions on Pattern Analysis and Machine Intelligence 2023 • Yanyuan Qiao, Yuankai Qi, Yicong Hong, Zheng Yu, Peng Wang, and Qi Wu ̊

To address these problems, we present a history-enhanced and order-aware pre-training with the complementing fine-tuning paradigm (HOP+) for VLN.

Decision Making Language Modelling +2

Paper
Add Code

A Unified Object Counting Network with Object Occupation Prior

1 code implementation • 29 Dec 2022 • Shengqin Jiang, Qing Wang, Fengna Cheng, Yuankai Qi, Qingshan Liu

In this paper, we build the first evolving object counting dataset and propose a unified object counting network as the first attempt to address this task.

Crowd Counting Knowledge Distillation +2

Paper
Code

Exploiting Completeness and Uncertainty of Pseudo Labels for Weakly Supervised Video Anomaly Detection

no code implementations • CVPR 2023 • Chen Zhang, Guorong Li, Yuankai Qi, Shuhui Wang, Laiyun Qing, Qingming Huang, Ming-Hsuan Yang

Weakly supervised video anomaly detection aims to identify abnormal events in videos using only video-level labels.

Anomaly Detection Pseudo Label +1

Paper
Add Code

BEVBert: Multimodal Map Pre-training for Language-guided Navigation

1 code implementation • ICCV 2023 • Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, Jing Shao

Concretely, we build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map.

Ranked #2 on Visual Navigation on R2R

Vision and Language Navigation Visual Navigation

164

Paper
Code

Progressive Multi-resolution Loss for Crowd Counting

1 code implementation • 8 Dec 2022 • Ziheng Yan, Yuankai Qi, Guorong Li, Xinyan Liu, Weigang Zhang, Qingming Huang, Ming-Hsuan Yang

Crowd counting is usually handled in a density map regression fashion, which is supervised via a L2 loss between the predicted density map and ground truth.

Crowd Counting

Paper
Code

Consistency-Aware Anchor Pyramid Network for Crowd Localization

no code implementations • 8 Dec 2022 • Xinyan Liu, Guorong Li, Yuankai Qi, Zhenjun Han, Qingming Huang, Ming-Hsuan Yang, Nicu Sebe

Crowd localization aims to predict the spatial position of humans in a crowd scenario.

Position

Paper
Add Code

Learning to Dub Movies via Hierarchical Prosody Models

1 code implementation • CVPR 2023 • Gaoxiang Cong, Liang Li, Yuankai Qi, ZhengJun Zha, Qi Wu, Wenyu Wang, Bin Jiang, Ming-Hsuan Yang, Qingming Huang

Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference.

Paper
Code

Multi-Attention Network for Compressed Video Referring Object Segmentation

1 code implementation • 26 Jul 2022 • Weidong Chen, Dexiang Hong, Yuankai Qi, Zhenjun Han, Shuhui Wang, Laiyun Qing, Qingming Huang, Guorong Li

To address this problem, we propose a multi-attention network which consists of dual-path dual-attention module and a query-based cross-modal Transformer module.

Ranked #5 on Referring Expression Segmentation on A2D Sentences

Object Referring Expression Segmentation +4

Paper
Code

HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation

1 code implementation • CVPR 2022 • Yanyuan Qiao, Yuankai Qi, Yicong Hong, Zheng Yu, Peng Wang, Qi Wu

Pre-training has been adopted in a few of recent works for Vision-and-Language Navigation (VLN).

Ranked #4 on Visual Navigation on R2R

Decision Making Language Modelling +3

Paper
Code

V2C: Visual Voice Cloning

no code implementations • CVPR 2022 • Qi Chen, Yuanqing Li, Yuankai Qi, Jiaqiu Zhou, Mingkui Tan, Qi Wu

Existing Voice Cloning (VC) tasks aim to convert a paragraph text to a speech with desired voice specified by a reference audio.

Voice Cloning

Paper
Add Code

Hierarchical Modular Network for Video Captioning

1 code implementation • CVPR 2022 • Hanhua Ye, Guorong Li, Yuankai Qi, Shuhui Wang, Qingming Huang, Ming-Hsuan Yang

(II) Predicate level, which learns the actions conditioned on highlighted objects and is supervised by the predicate in captions.

Representation Learning Sentence +1

Paper
Code

Neighbor-view Enhanced Model for Vision and Language Navigation

1 code implementation • 15 Jul 2021 • Dong An, Yuankai Qi, Yan Huang, Qi Wu, Liang Wang, Tieniu Tan

Specifically, our NvEM utilizes a subject module and a reference module to collect contexts from neighbor views.

Ranked #82 on Vision and Language Navigation on VLN Challenge

Navigate Vision and Language Navigation

Paper
Code

VLN BERT: A Recurrent Vision-and-Language BERT for Navigation

no code implementations • CVPR 2021 • Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez-Opazo, Stephen Gould

In this paper we propose a recurrent BERT model that is time-aware for use in VLN.

Decision Making Decoder +2

Paper
Add Code

The Road to Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation

1 code implementation • ICCV 2021 • Yuankai Qi, Zizheng Pan, Yicong Hong, Ming-Hsuan Yang, Anton Van Den Hengel, Qi Wu

Vision-and-Language Navigation (VLN) requires an agent to find a path to a remote location on the basis of natural-language instructions and a set of photo-realistic panoramas.

Vision and Language Navigation Vision-Language Navigation

Paper
Code

Diagnosing Vision-and-Language Navigation: What Really Matters

1 code implementation • NAACL 2022 • Wanrong Zhu, Yuankai Qi, Pradyumna Narayana, Kazoo Sone, Sugato Basu, Xin Eric Wang, Qi Wu, Miguel Eckstein, William Yang Wang

Results show that indoor navigation agents refer to both object and direction tokens when making decisions.

Object Vision and Language Navigation

Paper
Code

A Recurrent Vision-and-Language BERT for Navigation

1 code implementation • 26 Nov 2020 • Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez-Opazo, Stephen Gould

In this paper we propose a recurrent BERT model that is time-aware for use in VLN.

Ranked #7 on Visual Navigation on R2R

Decision Making Decoder +4

144

Paper
Code

Language and Visual Entity Relationship Graph for Agent Navigation

1 code implementation • NeurIPS 2020 • Yicong Hong, Cristian Rodriguez-Opazo, Yuankai Qi, Qi Wu, Stephen Gould

From both the textual and visual perspectives, we find that the relationships among the scene, its objects, and directional clues are essential for the agent to interpret complex instructions and correctly perceive the environment.

Dynamic Time Warping Navigate +2

Paper
Code

Object-and-Action Aware Model for Visual Language Navigation

no code implementations • ECCV 2020 • Yuankai Qi, Zizheng Pan, Shengping Zhang, Anton Van Den Hengel, Qi Wu

The first is object description (e. g., 'table', 'door'), each presenting as a tip for the agent to determine the next action by finding the item visible in the environment, and the second is action specification (e. g., 'go straight', 'turn left') which allows the robot to directly predict the next movements without relying on visual perceptions.

Object Vision and Language Navigation

Paper
Add Code

Scene Text Recognition via Transformer

no code implementations • 18 Mar 2020 • Xinjie Feng, Hongxun Yao, Yuankai Qi, Jun Zhang, Shengping Zhang

Different from previous transformer based models [56, 34], which just use the decoder of the transformer to decode the convolutional attention, the proposed method use a convolutional feature maps as word embedding input into transformer.

Decoder Scene Text Recognition

Paper
Add Code

REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments

1 code implementation • CVPR 2020 • Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, Anton Van Den Hengel

One of the long-term challenges of robotics is to enable robots to interact with humans in the visual world via natural language, as humans are visual animals that communicate through language.

Referring Expression Vision and Language Navigation

105

Paper
Code

The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking

no code implementations • ECCV 2018 • Dawei Du, Yuankai Qi, Hongyang Yu, Yifan Yang, Kaiwen Duan, Guorong Li, Weigang Zhang, Qingming Huang, Qi Tian

Selected from 10 hours raw videos, about 80, 000 representative frames are fully annotated with bounding boxes as well as up to 14 kinds of attributes (e. g., weather condition, flying altitude, camera view, vehicle category, and occlusion) for three fundamental computer vision tasks: object detection, single object tracking, and multiple object tracking.

Ranked #5 on Object Detection on UAVDT

Multiple Object Tracking Object +3

Paper
Add Code

Video Object Segmentation with Re-identification

3 code implementations • 1 Aug 2017 • Xiaoxiao Li, Yuankai Qi, Zhe Wang, Kai Chen, Ziwei Liu, Jianping Shi, Ping Luo, Xiaoou Tang, Chen Change Loy

Specifically, our Video Object Segmentation with Re-identification (VS-ReID) model includes a mask propagation module and a ReID module.

Object Segmentation +4

289

Paper
Code

Hedged Deep Tracking

no code implementations • CVPR 2016 • Yuankai Qi, Shengping Zhang, Lei Qin, Hongxun Yao, Qingming Huang, Jongwoo Lim, Ming-Hsuan Yang

In recent years, several methods have been developed to utilize hierarchical features learned from a deep convolutional neural network (CNN) for visual tracking.

Visual Tracking

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.