TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	ImageNet	MobileViTv3-XXS	Top 1 Accuracy	70.98%	# 942
Image Classification	ImageNet	MobileViTv3-XXS	Number of params	1.2M	# 350
Image Classification	ImageNet	MobileViTv3-XXS	GFLOPs	0.289	# 28
Image Classification	ImageNet	MobileViTv3-0.5	Top 1 Accuracy	72.33%	# 925
Image Classification	ImageNet	MobileViTv3-0.5	Number of params	1.4M	# 352
Image Classification	ImageNet	MobileViTv3-0.5	GFLOPs	0.481	# 52
Image Classification	ImageNet	MobileViTv3-S	Top 1 Accuracy	79.3%	# 706
Image Classification	ImageNet	MobileViTv3-S	Number of params	5.8M	# 433
Image Classification	ImageNet	MobileViTv3-S	GFLOPs	1.841	# 142
Image Classification	ImageNet	MobileViTv3-0.75	Top 1 Accuracy	76.55%	# 842
Image Classification	ImageNet	MobileViTv3-0.75	Number of params	3M	# 366
Image Classification	ImageNet	MobileViTv3-0.75	GFLOPs	1.064	# 107
Image Classification	ImageNet	MobileViTv3-XS	Top 1 Accuracy	76.7%	# 832
Image Classification	ImageNet	MobileViTv3-XS	Number of params	2.5M	# 360
Image Classification	ImageNet	MobileViTv3-XS	GFLOPs	0.927	# 101
Image Classification	ImageNet	MobileViTv3-1.0	Top 1 Accuracy	78.64%	# 750
Image Classification	ImageNet	MobileViTv3-1.0	GFLOPs	1.876	# 143

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mobilevitv3-mobile-friendly-vision/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=mobilevitv3-mobile-friendly-vision)`

MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features

30 Sep 2022 · Shakti N. Wadekar, Abhishek Chaurasia ·

MobileViT (MobileViTv1) combines convolutional neural networks (CNNs) and vision transformers (ViTs) to create light-weight models for mobile vision tasks. Though the main MobileViTv1-block helps to achieve competitive state-of-the-art results, the fusion block inside MobileViTv1-block, creates scaling challenges and has a complex learning task. We propose changes to the fusion block that are simple and effective to create MobileViTv3-block, which addresses the scaling and simplifies the learning task. Our proposed MobileViTv3-block used to create MobileViTv3-XXS, XS and S models outperform MobileViTv1 on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets. On ImageNet-1K, MobileViTv3-XXS and MobileViTv3-XS surpasses MobileViTv1-XXS and MobileViTv1-XS by 2% and 1.9% respectively. Recently published MobileViTv2 architecture removes fusion block and uses linear complexity transformers to perform better than MobileViTv1. We add our proposed fusion block to MobileViTv2 to create MobileViTv3-0.5, 0.75 and 1.0 models. These new models give better accuracy numbers on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets as compared to MobileViTv2. MobileViTv3-0.5 and MobileViTv3-0.75 outperforms MobileViTv2-0.5 and MobileViTv2-0.75 by 2.1% and 1.0% respectively on ImageNet-1K dataset. For segmentation task, MobileViTv3-1.0 achieves 2.07% and 1.1% better mIOU compared to MobileViTv2-1.0 on ADE20K dataset and PascalVOC2012 dataset respectively. Our code and the trained models are available at: https://github.com/micronDLA/MobileViTv3

PDF Abstract

Code

Add Remove Mark official

microndla/mobilevitv3 official

191

jaiwei98/mobile-vit-pytorch

Tasks

Add Remove

Image Classification

Object Detection

Semantic Segmentation

Datasets

ImageNet

MS COCO

ssd

ADE20K ImageNet-1K

PASCAL VOC

Results from the Paper

Edit

Ranked #706 on Image Classification on ImageNet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	ImageNet	MobileViTv3-XXS	Top 1 Accuracy	70.98%	# 942	Compare
			Number of params	1.2M	# 350	Compare
			GFLOPs	0.289	# 28	Compare
Image Classification	ImageNet	MobileViTv3-0.5	Top 1 Accuracy	72.33%	# 925	Compare
			Number of params	1.4M	# 352	Compare
			GFLOPs	0.481	# 52	Compare
Image Classification	ImageNet	MobileViTv3-S	Top 1 Accuracy	79.3%	# 706	Compare
			Number of params	5.8M	# 433	Compare
			GFLOPs	1.841	# 142	Compare
Image Classification	ImageNet	MobileViTv3-0.75	Top 1 Accuracy	76.55%	# 842	Compare
			Number of params	3M	# 366	Compare
			GFLOPs	1.064	# 107	Compare
Image Classification	ImageNet	MobileViTv3-XS	Top 1 Accuracy	76.7%	# 832	Compare
			Number of params	2.5M	# 360	Compare
			GFLOPs	0.927	# 101	Compare
Image Classification	ImageNet	MobileViTv3-1.0	Top 1 Accuracy	78.64%	# 750	Compare
Image Classification	ImageNet	MobileViTv3-1.0	GFLOPs	1.876	# 143	Compare

Methods

Add Remove

1x1 Convolution • Batch Normalization • Convolution • Dense Connections • Depthwise Convolution • Depthwise Separable Convolution • Dropout • Layer Normalization • Linear Layer • Multi-Head Attention • Pointwise Convolution • Residual Connection • Scaled Dot-Product Attention • Softmax • Vision Transformer

Edit Social Preview

MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove