Search Results for author: Xizhou Zhu

Found 44 papers, 38 papers with code

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

1 code implementation • 25 Apr 2024 • Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang

Compared to both open-source and proprietary models, InternVL 1. 5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks.

Ranked #6 on Visual Question Answering on MM-Vet

4k Language Modelling +3

2,955

Paper
Code

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

1 code implementation • 4 Mar 2024 • Yuchen Duan, Weiyun Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Hongsheng Li, Jifeng Dai, Wenhai Wang

Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage processing high-resolution inputs.

Image Classification

258

Paper
Code

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

1 code implementation • 29 Feb 2024 • Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, Yu Qiao, Jifeng Dai

In addition, we design a new benchmark, termed Circular-based Relation Probing Evaluation (CRPE) for comprehensively evaluating the relation comprehension capabilities of MLLMs.

Ranked #33 on Visual Question Answering on MM-Vet

Hallucination Object Localization +3

390

Paper
Code

MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer

1 code implementation • 18 Jan 2024 • Changyao Tian, Xizhou Zhu, Yuwen Xiong, Weiyun Wang, Zhe Chen, Wenhai Wang, Yuntao Chen, Lewei Lu, Tong Lu, Jie zhou, Hongsheng Li, Yu Qiao, Jifeng Dai

Developing generative models for interleaved image-text data has both research and practical value.

161

Paper
Code

Multi-scale 2D Temporal Map Diffusion Models for Natural Language Video Localization

no code implementations • 16 Jan 2024 • Chongzhi Zhang, Mingyuan Zhang, Zhiyang Teng, Jiayi Li, Xizhou Zhu, Lewei Lu, Ziwei Liu, Aixin Sun

Our method involves the direct generation of a global 2D temporal map via a conditional denoising diffusion process, based on the input video and language query.

Decoder Denoising +1

Paper
Add Code

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

1 code implementation • 11 Jan 2024 • Yuwen Xiong, Zhiqi Li, Yuntao Chen, Feng Wang, Xizhou Zhu, Jiapeng Luo, Wenhai Wang, Tong Lu, Hongsheng Li, Yu Qiao, Lewei Lu, Jie zhou, Jifeng Dai

The advancements in speed and efficiency of DCNv4, combined with its robust performance across diverse vision tasks, show its potential as a foundational building block for future vision models.

Image Classification Image Generation +1

359

Paper
Code

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

2 code implementations • 21 Dec 2023 • Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai

However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs.

Ranked #1 on Zero-Shot Video Retrieval on MSR-VTT-full (using extra training data)

Image Retrieval Image-to-Text Retrieval +11

2,955

Paper
Code

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft

no code implementations • 14 Dec 2023 • Hao Li, Xue Yang, Zhaokai Wang, Xizhou Zhu, Jie zhou, Yu Qiao, Xiaogang Wang, Hongsheng Li, Lewei Lu, Jifeng Dai

Many reinforcement learning environments (e. g., Minecraft) provide only sparse rewards that indicate task completion or failure with binary values.

reinforcement-learning

Paper
Add Code

DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving

1 code implementation • 14 Dec 2023 • Wenhai Wang, Jiangwei Xie, Chuanyang Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, Hao Tian, Lewei Lu, Xizhou Zhu, Xiaogang Wang, Yu Qiao, Jifeng Dai

In this work, we delve into the potential of large language models (LLMs) in autonomous driving (AD).

Autonomous Driving Motion Planning

127

Paper
Code

ControlLLM: Augment Language Models with Tools by Searching on Graphs

1 code implementation • 26 Oct 2023 • Zhaoyang Liu, Zeqiang Lai, Zhangwei Gao, Erfei Cui, Ziheng Li, Xizhou Zhu, Lewei Lu, Qifeng Chen, Yu Qiao, Jifeng Dai, Wenhai Wang

We present ControlLLM, a novel framework that enables large language models (LLMs) to utilize multi-modal tools for solving complex real-world tasks.

Scheduling

165

Paper
Code

Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models

1 code implementation • 11 Oct 2023 • Zeqiang Lai, Xizhou Zhu, Jifeng Dai, Yu Qiao, Wenhai Wang

The revolution of artificial intelligence content generation has been rapidly accelerated with the booming text-to-image (T2I) diffusion models.

Code Generation Image Generation +2

292

Paper
Code

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

1 code implementation • 3 Aug 2023 • Weiyun Wang, Min Shi, Qingyun Li, Wenhai Wang, Zhenhang Huang, Linjie Xing, Zhe Chen, Hao Li, Xizhou Zhu, Zhiguo Cao, Yushi Chen, Tong Lu, Jifeng Dai, Yu Qiao

We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world.

Question Answering Retrieval +1

390

Paper
Code

ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process

1 code implementation • 8 Jun 2023 • Changyao Tian, Chenxin Tao, Jifeng Dai, Hao Li, Ziheng Li, Lewei Lu, Xiaogang Wang, Hongsheng Li, Gao Huang, Xizhou Zhu

In each denoising step, our method first decodes pixels from previous VQ tokens, then generates new VQ tokens from the decoded pixels.

Denoising Representation Learning

Paper
Code

Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory

1 code implementation • 25 May 2023 • Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, Yu Qiao, Zhaoxiang Zhang, Jifeng Dai

These agents, equipped with the logic and common sense capabilities of LLMs, can skillfully navigate complex, sparse-reward environments with text-based interactions.

Common Sense Reasoning Navigate +1

573

Paper
Code

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

2 code implementations • NeurIPS 2023 • Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie zhou, Yu Qiao, Jifeng Dai

We hope this model can set a new baseline for generalist vision and language models.

Decoder Language Modelling +1

3,146

Paper
Code

InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language

2 code implementations • 9 May 2023 • Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, Kunchang Li, Zhe Chen, Xue Yang, Xizhou Zhu, Yali Wang, LiMin Wang, Ping Luo, Jifeng Dai, Yu Qiao

Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iGPT significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2.

Language Modelling

3,146

Paper
Code

Planning-oriented Autonomous Driving

1 code implementation • CVPR 2023 • Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, Hongyang Li

Oriented at this, we revisit the key components within perception and prediction, and prioritize the tasks such that all these tasks contribute to planning.

Autonomous Driving Philosophy

2,943

Paper
Code

BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision

2 code implementations • CVPR 2023 • Chenyu Yang, Yuntao Chen, Hao Tian, Chenxin Tao, Xizhou Zhu, Zhaoxiang Zhang, Gao Huang, Hongyang Li, Yu Qiao, Lewei Lu, Jie zhou, Jifeng Dai

The proposed method is verified with a wide spectrum of traditional and modern image backbones and achieves new SoTA results on the large-scale nuScenes dataset.

Ranked #5 on 3D Object Detection on Rope3D

3D Object Detection

2,965

Paper
Code

Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information

1 code implementation • CVPR 2023 • Weijie Su, Xizhou Zhu, Chenxin Tao, Lewei Lu, Bin Li, Gao Huang, Yu Qiao, Xiaogang Wang, Jie zhou, Jifeng Dai

It has been proved that combining multiple pre-training strategies and data from various modalities/sources can greatly boost the training of large-scale models.

Ranked #2 on Semantic Segmentation on ADE20K (using extra training data)

Image Classification Long-tailed Object Detection +3

Paper
Code

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

2 code implementations • CVPR 2023 • Hao Li, Jinguo Zhu, Xiaohu Jiang, Xizhou Zhu, Hongsheng Li, Chun Yuan, Xiaohua Wang, Yu Qiao, Xiaogang Wang, Wenhai Wang, Jifeng Dai

In this paper, we propose Uni-Perceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-language tasks with competitive performance.

Decoder Language Modelling +1

2,353

Paper
Code

Demystify Transformers & Convolutions in Modern Image Deep Networks

1 code implementation • 10 Nov 2022 • Xiaowei Hu, Min Shi, Weiyun Wang, Sitong Wu, Linjie Xing, Wenhai Wang, Xizhou Zhu, Lewei Lu, Jie zhou, Xiaogang Wang, Yu Qiao, Jifeng Dai

Our experiments on various tasks and an analysis of inductive bias show a significant performance boost due to advanced network-level and block-level designs, but performance differences persist among different STMs.

Image Deep Networks Spatial Token Mixer

Paper
Code

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

2 code implementations • CVPR 2023 • Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, Yu Qiao

Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state.

Ranked #1 on Instance Segmentation on COCO test-dev (AP50 metric, using extra training data)

Classification Image Classification +3

2,353

Paper
Code

Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs

1 code implementation • 9 Jun 2022 • Jinguo Zhu, Xizhou Zhu, Wenhai Wang, Xiaohua Wang, Hongsheng Li, Xiaogang Wang, Jifeng Dai

To mitigate such interference, we introduce the Conditional Mixture-of-Experts (Conditional MoEs) to generalist models.

Image Captioning Image Classification +6

253

Paper
Code

Siamese Image Modeling for Self-Supervised Vision Representation Learning

2 code implementations • CVPR 2023 • Chenxin Tao, Xizhou Zhu, Weijie Su, Gao Huang, Bin Li, Jie zhou, Yu Qiao, Xiaogang Wang, Jifeng Dai

Driven by these analysis, we propose Siamese Image Modeling (SiameseIM), which predicts the dense representations of an augmented view, based on another masked view from the same image but with different augmentations.

Representation Learning Self-Supervised Learning +1

Paper
Code

DeciWatch: A Simple Baseline for 10x Efficient 2D and 3D Pose Estimation

1 code implementation • 16 Mar 2022 • Ailing Zeng, Xuan Ju, Lei Yang, Ruiyuan Gao, Xizhou Zhu, Bo Dai, Qiang Xu

This paper proposes a simple baseline framework for video-based 2D/3D human pose estimation that can achieve 10 times efficiency improvement over existing works without any performance degradation, named DeciWatch.

Ranked #1 on 2D Human Pose Estimation on JHMDB (2D poses only)

2D Human Pose Estimation 3D Human Pose Estimation +2

169

Paper
Code

Searching Parameterized AP Loss for Object Detection

1 code implementation • NeurIPS 2021 • Chenxin Tao, Zizhang Li, Xizhou Zhu, Gao Huang, Yong liu, Jifeng Dai

In this paper, we propose Parameterized AP Loss, where parameterized functions are introduced to substitute the non-differentiable components in the AP calculation.

Object object-detection +1

Paper
Code

Exploring the Equivalence of Siamese Self-Supervised Learning via A Unified Gradient Framework

1 code implementation • CVPR 2022 • Chenxin Tao, Honghui Wang, Xizhou Zhu, Jiahua Dong, Shiji Song, Gao Huang, Jifeng Dai

These methods appear to be quite different in the designed loss functions from various motivations.

Contrastive Learning Linear evaluation +1

Paper
Code

Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

1 code implementation • CVPR 2022 • Xizhou Zhu, Jinguo Zhu, Hao Li, Xiaoshi Wu, Xiaogang Wang, Hongsheng Li, Xiaohua Wang, Jifeng Dai

The model is pre-trained on several uni-modal and multi-modal tasks, and evaluated on a variety of downstream tasks, including novel tasks that did not appear in the pre-training stage.

253

Paper
Code

VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition

1 code implementation • 26 Nov 2021 • Changyao Tian, Wenhai Wang, Xizhou Zhu, Jifeng Dai, Yu Qiao

Deep learning-based models encounter challenges when processing long-tailed data in the real world.

Ranked #4 on Long-tail Learning on iNaturalist 2018 (using extra training data)

Image Classification Long-tail Learning +1

Paper
Code

Collaborative Visual Navigation

1 code implementation • 2 Jul 2021 • Haiyang Wang, Wenguan Wang, Xizhou Zhu, Jifeng Dai, LiWei Wang

As a fundamental problem for Artificial Intelligence, multi-agent system (MAS) is making rapid progress, mainly driven by multi-agent reinforcement learning (MARL) techniques.

Multi-agent Reinforcement Learning Navigate +1

Paper
Code

AutoLoss-Zero: Searching Loss Functions from Scratch for Generic Tasks

no code implementations • CVPR 2022 • Hao Li, Tianwen Fu, Jifeng Dai, Hongsheng Li, Gao Huang, Xizhou Zhu

However, the automatic design of loss functions for generic tasks with various evaluation metrics remains under-investigated.

Paper
Add Code

Unsupervised Object Detection with LiDAR Clues

no code implementations • CVPR 2021 • Hao Tian, Yuntao Chen, Jifeng Dai, Zhaoxiang Zhang, Xizhou Zhu

We further identify another major issue, seldom noticed by the community, that the long-tailed and open-ended (sub-)category distribution should be accommodated.

Object object-detection +2

Paper
Add Code

Auto Seg-Loss: Searching Metric Surrogates for Semantic Segmentation

1 code implementation • ICLR 2021 • Hao Li, Chenxin Tao, Xizhou Zhu, Xiaogang Wang, Gao Huang, Jifeng Dai

In this paper, we propose to automate the design of metric-specific loss functions by searching differentiable surrogate losses for each metric.

Semantic Segmentation

Paper
Code

Deformable DETR: Deformable Transformers for End-to-End Object Detection

17 code implementations • ICLR 2021 • Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai

DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance.

Ranked #6 on 2D Object Detection on SARDet-100K

Real-Time Object Detection

12,211

Paper
Code

Spatially Adaptive Inference with Stochastic Feature Sampling and Interpolation

1 code implementation • ECCV 2020 • Zhenda Xie, Zheng Zhang, Xizhou Zhu, Gao Huang, Stephen Lin

In the feature maps of CNNs, there commonly exists considerable spatial redundancy that leads to much repetitive processing.

Paper
Code

Deformable Kernels: Adapting Effective Receptive Fields for Object Deformation

2 code implementations • ICLR 2020 • Hang Gao, Xizhou Zhu, Steve Lin, Jifeng Dai

This is typically done by augmenting static operators with learned free-form sampling grids in the image space, dynamically tuned to the data and task for adapting the receptive field.

Ranked #183 on Object Detection on COCO test-dev

Image Classification Object +1

201

Paper
Code

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

3 code implementations • ICLR 2020 • Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai

We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short).

Ranked #1 on Visual Question Answering (VQA) on VCR (Q-A) dev

Image-text matching Language Modelling +6

735

Paper
Code

An Empirical Study of Spatial Attention Mechanisms in Deep Networks

1 code implementation • ICCV 2019 • Xizhou Zhu, Dazhi Cheng, Zheng Zhang, Stephen Lin, Jifeng Dai

Attention mechanisms have become a popular component in deep neural networks, yet there has been little examination of how different influencing factors and methods for computing attention from these factors affect performance.

Decoder

28,139

Paper
Code