Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer, which significantly enhances the ViT of \cite{dosovitskiy2020image} for encoding high-resolution images using two techniques. The first is the multi-scale model structure, which provides image encodings at multiple scales with manageable computational cost. The second is the attention mechanism of vision Longformer, which is a variant of Longformer \cite{beltagy2020longformer}, originally developed for natural language processing, and achieves a linear complexity w.r.t. the number of input tokens. A comprehensive empirical study shows that the new ViT significantly outperforms several strong baselines, including the existing ViT models and their ResNet counterparts, and the Pyramid Vision Transformer from a concurrent work \cite{wang2021pyramid}, on a range of vision tasks, including image classification, object detection, and segmentation. The models and source code are released at \url{https://github.com/microsoft/vision-longformer}.

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract

Datasets


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Instance Segmentation COCO minival Mask R-CNN (ViL Base, 1x lr) mask AP 45.1 # 46
AP50 67.2 # 9
AP75 49.3 # 6
Instance Segmentation COCO minival Mask R-CNN (ViL Base, multi-scale, 3x lr) mask AP 45.7 # 45
AP75 49.9 # 5
Object Detection COCO minival RetinaNet (ViL-Base, multi-scale, 3x) box AP 44.7 # 114
AP75 47.6 # 43
APS 29.9 # 19
APM 48 # 30
APL 58.1 # 44
Object Detection COCO minival RetinaNet (ViL-Base) box AP 44.3 # 121
AP50 65.5 # 35
AP75 47.1 # 46
APS 28.9 # 21
APM 47.9 # 31
APL 58.3 # 43
Image Classification ImageNet ViL-Small Top 1 Accuracy 82% # 530
Number of params 24.6M # 585
GFLOPs 4.86 # 230
Image Classification ImageNet ViL-Base-D Top 1 Accuracy 83.2% # 413
Number of params 55.7M # 744
GFLOPs 13.4 # 330
Image Classification ImageNet ViL-Tiny-RPB Top 1 Accuracy 76.7% # 832
Number of params 6.7M # 449
GFLOPs 1.3 # 118
Image Classification ImageNet ViL-Base-W Top 1 Accuracy 81.9% # 543
Number of params 79M # 804
GFLOPs 6.74 # 247
Image Classification ImageNet ViL-Medium-D Top 1 Accuracy 83.3% # 403
Number of params 39.7M # 675
GFLOPs 8.7 # 285
Image Classification ImageNet ViL-Medium-W Top 1 Accuracy 82.9% # 445
Number of params 39.8M # 676

Methods