no code implementations • CCL 2022 • Yaqiang Wang, Xiao Yang, Xuechao Hao, Hongping Shu, Guo Chen, Tao Zhu
“准确的术后风险预测对临床资源规划和应急方案准备以及降低患者的术后风险和死亡率具有积极作用。术后风险预测目前主要基于术前和术中的患者基本信息、实验室检查、生命体征等结构化数据, 而蕴含丰富语义信息的非结构化术前诊断的价值还有待验证。针对该问题, 本文提出一种非结构化数据表征增强的术后风险预测模型, 利用自注意力机制, 精巧的将结构化数据与术前诊断数据进行信息加权融合。基于临床数据, 将本文方法与术后风险预测常用的统计机器学习模型以及最新的深度神经网络进行对比, 本文方法不仅提升了术后风险预测的性能, 同时也为预测模型带来了良好的可解释性。”
1 code implementation • 2 Apr 2024 • Kai Li, Guo Chen
Notably, within computer vision, Mamba-based methods have been celebrated for their formidable performance and reduced computational requirements.
1 code implementation • 24 Mar 2024 • Yifei HUANG, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, LiMin Wang, Yu Qiao
Along with the videos we record high-quality gaze data and provide detailed multimodal annotations, formulating a playground for modeling the human ability to bridge asynchronous procedural actions from different viewpoints.
2 code implementations • 22 Mar 2024 • Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei HUANG, Yu Qiao, Yali Wang, LiMin Wang
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.
Ranked #1 on Zero-Shot Video Question Answer on MVBench
1 code implementation • 14 Mar 2024 • Guo Chen, Yifei HUANG, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, LiMin Wang
We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks.
Ranked #1 on Temporal Action Localization on FineAction
no code implementations • 1 Jan 2024 • Jilan Xu, Yifei HUANG, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie
In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos.
2 code implementations • 21 Dec 2023 • Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai
However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs.
Ranked #1 on Zero-Shot Video Retrieval on MSR-VTT-full (using extra training data)
no code implementations • 11 Dec 2023 • Jiawen Yi, Guo Chen
This framework decouples the Text-to-SQL task based on query hardness by analyzing questions and schemas, simplifying the multi-hardness task into a single-hardness challenge.
2 code implementations • 28 Nov 2023 • Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, LiMin Wang, Yu Qiao
With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models.
Ranked #1 on Video Question Answering on IntentQA
1 code implementation • ICCV 2023 • Jiahao Wang, Guo Chen, Yifei HUANG, LiMin Wang, Tong Lu
Based on this idea, we present Memory-and-Anticipation Transformer (MAT), a memory-anticipation-based approach, to address the online action detection and anticipation tasks.
Ranked #1 on Action Detection on THUMOS' 14
1 code implementation • 13 Jul 2023 • Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Conghui He, Ping Luo, Ziwei Liu, Yali Wang, LiMin Wang, Yu Qiao
Specifically, we utilize a multi-scale approach to generate video-related descriptions.
1 code implementation • 3 Jul 2023 • Shengyi Gao, Zhe Chen, Guo Chen, Wenhai Wang, Tong Lu
In this paper, we propose AVSegFormer, a novel framework for AVS tasks that leverages the transformer architecture.
1 code implementation • 22 May 2023 • Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei HUANG, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, LiMin Wang
Building upon this insight, we propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs from natural language processing (NLP) for video sequence understanding.
no code implementations • 24 Apr 2023 • Yin-Dong Zheng, Guo Chen, Minglei Yuan, Tong Lu
Action detection is a challenging video understanding task, requiring modeling spatio-temporal and interaction relations.
1 code implementation • 22 Jan 2023 • Shengyi Gao, Zhe Chen, Guo Chen, Wenhai Wang, Tong Lu
In this report, we present our champion solution to the WSDM2023 Toloka Visual Question Answering (VQA) Challenge.
2 code implementations • 6 Dec 2022 • Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, LiMin Wang, Yu Qiao
Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications.
Ranked #1 on Action Recognition on Something-Something V1 (using extra training data)
2 code implementations • 17 Nov 2022 • Guo Chen, Sen Xing, Zhe Chen, Yi Wang, Kunchang Li, Yizhuo Li, Yi Liu, Jiahao Wang, Yin-Dong Zheng, Bingkun Huang, Zhiyu Zhao, Junting Pan, Yifei HUANG, Zun Wang, Jiashuo Yu, Yinan He, Hongjie Zhang, Tong Lu, Yali Wang, LiMin Wang, Yu Qiao
In this report, we present our champion solutions to five tracks at Ego4D challenge.
Ranked #1 on State Change Object Detection on Ego4D
no code implementations • 17 Nov 2022 • Yinan He, Guo Chen
In this report, we present the transferring pretrained video mask autoencoders(VideoMAE) to egocentric tasks for Ego4d Looking at me Challenge.
no code implementations • 16 Nov 2022 • Yin-Dong Zheng, Guo Chen, Jiahao Wang, Tong Lu, LiMin Wang
Our method achieves an accuracy of 0. 796 on OSCC while achieving an absolute temporal localization error of 0. 516 on PNR.
2 code implementations • 5 May 2022 • Min Yang, Guo Chen, Yin-Dong Zheng, Tong Lu, LiMin Wang
Empirical results demonstrate that our PlusTAD is very efficient and significantly outperforms the previous methods on the datasets of THUMOS14 and FineAction.
Ranked #1 on Temporal Action Localization on THUMOS14
no code implementations • 19 Apr 2022 • Yueming Li, Ying Jiang, Lu Lan, Xiaowei Ge, Ran Cheng, Yuewei Zhan, Guo Chen, Linli Shi, Runyu Wang, Nan Zheng, Chen Yang, Ji-Xin Cheng
Here, we report optically-generated focused ultrasound (OFUS) for non-invasive brain stimulation with ultrahigh precision.
1 code implementation • 7 Dec 2021 • Guo Chen, Yin-Dong Zheng, LiMin Wang, Tong Lu
Specifically, we design the Multi-Path Temporal Context Aggregation (MTCA) to achieve smooth context aggregation on boundary level and precise evaluation of boundaries.
Ranked #19 on Temporal Action Localization on ActivityNet-1.3
2 code implementations • 3 Nov 2021 • Zhe Chen, Jiahao Wang, Wenhai Wang, Guo Chen, Enze Xie, Ping Luo, Tong Lu
We propose an accurate and efficient scene text detection framework, termed FAST (i. e., faster arbitrarily-shaped text detector).
Ranked #2 on Scene Text Detection on MSRA-TD500
no code implementations • 1 Jul 2021 • Zhiyuan Guo, Yuexin Li, Guo Chen, Xingyu Chen, Akshat Gupta
Spoken dialogue systems such as Siri and Alexa provide great convenience to people's everyday life.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +6
no code implementations • 17 Dec 2020 • Linli Shi, Ying Jiang, Fernando R. Fernandez, Lu Lan, Guo Chen, Heng-ye Man, John A. White, Ji-Xin Cheng, Chen Yang
As an emerging technology, transcranial focused ultrasound has been demonstrated to successfully evoke motor responses in mice, rabbits, and sensory/motor responses in humans.