TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Object Detection	COCO minival	PVT-Large (RetinaNet 3x,MS)	box AP	43.4	# 127
Object Detection	COCO minival	PVT-Large (RetinaNet 3x,MS)	AP50	63.6	# 47
Object Detection	COCO minival	PVT-Large (RetinaNet 3x,MS)	AP75	46.1	# 54
Object Detection	COCO minival	PVT-Large (RetinaNet 3x,MS)	APS	26.1	# 40
Object Detection	COCO minival	PVT-Large (RetinaNet 3x,MS)	APM	46.0	# 43
Object Detection	COCO minival	PVT-Large (RetinaNet 3x,MS)	APL	59.5	# 32
Object Detection	COCO minival	PVT-Large (RetinaNet 1x)	box AP	42.6	# 137
Object Detection	COCO minival	PVT-Large (RetinaNet 1x)	AP50	63.7	# 46
Object Detection	COCO minival	PVT-Large (RetinaNet 1x)	AP75	45.4	# 61
Object Detection	COCO minival	PVT-Large (RetinaNet 1x)	APS	25.8	# 43
Object Detection	COCO minival	PVT-Large (RetinaNet 1x)	APM	46.0	# 43
Object Detection	COCO minival	PVT-Large (RetinaNet 1x)	APL	58.4	# 41
Semantic Segmentation	DensePASS	PVT (Tiny, FPN)	mIoU	31.20%	# 26
Semantic Segmentation	SynPASS	PVT	mIoU	32.68%	# 5

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pyramid-vision-transformer-a-versatile/semantic-segmentation-on-synpass)](https://paperswithcode.com/sota/semantic-segmentation-on-synpass?p=pyramid-vision-transformer-a-versatile)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pyramid-vision-transformer-a-versatile/semantic-segmentation-on-densepass)](https://paperswithcode.com/sota/semantic-segmentation-on-densepass?p=pyramid-vision-transformer-a-versatile)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/pyramid-vision-transformer-a-versatile/object-detection-on-coco-minival)](https://paperswithcode.com/sota/object-detection-on-coco-minival?p=pyramid-vision-transformer-a-versatile)`

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

ICCV 2021 · Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao ·

Although using convolutional neural networks (CNNs) as backbones achieves great successes in computer vision, this work investigates a simple backbone network useful for many dense prediction tasks without convolutions. Unlike the recently-proposed Transformer model (e.g., ViT) that is specially designed for image classification, we propose Pyramid Vision Transformer~(PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks. PVT has several merits compared to prior arts. (1) Different from ViT that typically has low-resolution outputs and high computational and memory cost, PVT can be not only trained on dense partitions of the image to achieve high output resolution, which is important for dense predictions but also using a progressive shrinking pyramid to reduce computations of large feature maps. (2) PVT inherits the advantages from both CNN and Transformer, making it a unified backbone in various vision tasks without convolutions by simply replacing CNN backbones. (3) We validate PVT by conducting extensive experiments, showing that it boosts the performance of many downstream tasks, e.g., object detection, semantic, and instance segmentation. For example, with a comparable number of parameters, RetinaNet+PVT achieves 40.4 AP on the COCO dataset, surpassing RetinNet+ResNet50 (36.3 AP) by 4.1 absolute AP. We hope PVT could serve as an alternative and useful backbone for pixel-level predictions and facilitate future researches. Code is available at https://github.com/whai362/PVT.

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract

Code

Add Remove Mark official

whai362/PVT official

1,640

open-mmlab/mmdetection

27,693

open-mmlab/mmpose

4,957

hustvl/sparseinst

560

martinsbruveris/tensorflow-image-mo…

279

See all 9 implementations

Tasks

Add Remove

Image Classification

Instance Segmentation

object-detection

Object Detection

Semantic Segmentation

Datasets

ImageNet

MS COCO

ADE20K

DensePASS

Results from the Paper

Edit

Ranked #5 on Semantic Segmentation on SynPASS

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Object Detection	COCO minival	PVT-Large (RetinaNet 3x,MS)	box AP	43.4	# 127	Compare
			AP50	63.6	# 47	Compare
			AP75	46.1	# 54	Compare
			APS	26.1	# 40	Compare
			APM	46.0	# 43	Compare
			APL	59.5	# 32	Compare
Object Detection	COCO minival	PVT-Large (RetinaNet 1x)	box AP	42.6	# 137	Compare
			AP50	63.7	# 46	Compare
			AP75	45.4	# 61	Compare
			APS	25.8	# 43	Compare
			APM	46.0	# 43	Compare
			APL	58.4	# 41	Compare
Semantic Segmentation	DensePASS	PVT (Tiny, FPN)	mIoU	31.20%	# 26	Compare
Semantic Segmentation	SynPASS	PVT	mIoU	32.68%	# 5	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • GELU • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • PVT • Residual Connection • Scaled Dot-Product Attention • Softmax • Spatial-Reduction Attention • Transformer

Edit Social Preview

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove