TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Efficient ViTs	ImageNet-1K (with DeiT-S)	MCTF ($r=16$)	Top 1 Accuracy	80.1	# 1
Efficient ViTs	ImageNet-1K (with DeiT-S)	MCTF ($r=16$)	GFLOPs	2.6	# 12
Efficient ViTs	ImageNet-1K (with DeiT-S)	MCTF ($r=20$)	Top 1 Accuracy	79.5	# 19
Efficient ViTs	ImageNet-1K (with DeiT-S)	MCTF ($r=20$)	GFLOPs	2.2	# 4
Efficient ViTs	ImageNet-1K (with DeiT-S)	MCTF ($r=18$)	Top 1 Accuracy	79.9	# 3
Efficient ViTs	ImageNet-1K (with DeiT-S)	MCTF ($r=18$)	GFLOPs	2.4	# 10
Efficient ViTs	ImageNet-1K (with DeiT-T)	MCTF ($r=8$)	Top 1 Accuracy	72.9	# 1
Efficient ViTs	ImageNet-1K (with DeiT-T)	MCTF ($r=8$)	GFLOPs	1.0	# 18
Efficient ViTs	ImageNet-1K (with DeiT-T)	MCTF ($r=20$)	Top 1 Accuracy	71.4	# 16
Efficient ViTs	ImageNet-1K (with DeiT-T)	MCTF ($r=20$)	GFLOPs	0.6	# 1
Efficient ViTs	ImageNet-1K (with DeiT-T)	MCTF ($r=16$)	Top 1 Accuracy	72.7	# 3
Efficient ViTs	ImageNet-1K (with DeiT-T)	MCTF ($r=16$)	GFLOPs	0.7	# 5
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	MCTF ($r=8$)	Top 1 Accuracy	83.5	# 1
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	MCTF ($r=8$)	GFLOPs	4.9	# 4
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	MCTF ($r=12$)	Top 1 Accuracy	83.4	# 2
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	MCTF ($r=12$)	GFLOPs	4.2	# 13
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	MCTF ($r=16$)	Top 1 Accuracy	82.3	# 19
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	MCTF ($r=16$)	GFLOPs	3.6	# 19

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-criteria-token-fusion-with-one-step/efficient-vits-on-imagenet-1k-with-deit-s)](https://paperswithcode.com/sota/efficient-vits-on-imagenet-1k-with-deit-s?p=multi-criteria-token-fusion-with-one-step)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-criteria-token-fusion-with-one-step/efficient-vits-on-imagenet-1k-with-deit-t)](https://paperswithcode.com/sota/efficient-vits-on-imagenet-1k-with-deit-t?p=multi-criteria-token-fusion-with-one-step)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-criteria-token-fusion-with-one-step/efficient-vits-on-imagenet-1k-with-lv-vit-s)](https://paperswithcode.com/sota/efficient-vits-on-imagenet-1k-with-lv-vit-s?p=multi-criteria-token-fusion-with-one-step)`

Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers

15 Mar 2024 · Sanghyeok Lee, Joonmyung Choi, Hyunwoo J. Kim ·

Vision Transformer (ViT) has emerged as a prominent backbone for computer vision. For more efficient ViTs, recent works lessen the quadratic cost of the self-attention layer by pruning or fusing the redundant tokens. However, these works faced the speed-accuracy trade-off caused by the loss of information. Here, we argue that token fusion needs to consider diverse relations between tokens to minimize information loss. In this paper, we propose a Multi-criteria Token Fusion (MCTF), that gradually fuses the tokens based on multi-criteria (e.g., similarity, informativeness, and size of fused tokens). Further, we utilize the one-step-ahead attention, which is the improved approach to capture the informativeness of the tokens. By training the model equipped with MCTF using a token reduction consistency, we achieve the best speed-accuracy trade-off in the image classification (ImageNet1K). Experimental results prove that MCTF consistently surpasses the previous reduction methods with and without training. Specifically, DeiT-T and DeiT-S with MCTF reduce FLOPs by about 44% while improving the performance (+0.5%, and +0.3%) over the base model, respectively. We also demonstrate the applicability of MCTF in various Vision Transformers (e.g., T2T-ViT, LV-ViT), achieving at least 31% speedup without performance degradation. Code is available at https://github.com/mlvlab/MCTF.

PDF Abstract

Code

Add Remove Mark official

mlvlab/mctf official

Tasks

Add Remove

Computational Efficiency

Efficient ViTs

Image Classification

Datasets

ImageNet

Results from the Paper

Add Remove

Ranked #1 on Efficient ViTs on ImageNet-1K (With LV-ViT-S)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Efficient ViTs	ImageNet-1K (with DeiT-S)	MCTF ($r=16$)	Top 1 Accuracy	80.1	# 1	Compare
Efficient ViTs	ImageNet-1K (with DeiT-S)	MCTF ($r=16$)	GFLOPs	2.6	# 12	Compare
Efficient ViTs	ImageNet-1K (with DeiT-S)	MCTF ($r=20$)	Top 1 Accuracy	79.5	# 19	Compare
Efficient ViTs	ImageNet-1K (with DeiT-S)	MCTF ($r=20$)	GFLOPs	2.2	# 4	Compare
Efficient ViTs	ImageNet-1K (with DeiT-S)	MCTF ($r=18$)	Top 1 Accuracy	79.9	# 3	Compare
Efficient ViTs	ImageNet-1K (with DeiT-S)	MCTF ($r=18$)	GFLOPs	2.4	# 10	Compare
Efficient ViTs	ImageNet-1K (with DeiT-T)	MCTF ($r=8$)	Top 1 Accuracy	72.9	# 1	Compare
Efficient ViTs	ImageNet-1K (with DeiT-T)	MCTF ($r=8$)	GFLOPs	1.0	# 18	Compare
Efficient ViTs	ImageNet-1K (with DeiT-T)	MCTF ($r=20$)	Top 1 Accuracy	71.4	# 16	Compare
Efficient ViTs	ImageNet-1K (with DeiT-T)	MCTF ($r=20$)	GFLOPs	0.6	# 1	Compare
Efficient ViTs	ImageNet-1K (with DeiT-T)	MCTF ($r=16$)	Top 1 Accuracy	72.7	# 3	Compare
Efficient ViTs	ImageNet-1K (with DeiT-T)	MCTF ($r=16$)	GFLOPs	0.7	# 5	Compare
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	MCTF ($r=8$)	Top 1 Accuracy	83.5	# 1	Compare
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	MCTF ($r=8$)	GFLOPs	4.9	# 4	Compare
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	MCTF ($r=12$)	Top 1 Accuracy	83.4	# 2	Compare
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	MCTF ($r=12$)	GFLOPs	4.2	# 13	Compare
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	MCTF ($r=16$)	Top 1 Accuracy	82.3	# 19	Compare
Efficient ViTs	ImageNet-1K (With LV-ViT-S)	MCTF ($r=16$)	GFLOPs	3.6	# 19	Compare

Methods

Add Remove

DeiT • Dropout • LV-ViT • Pruning • T2T-ViT • Vision Transformer

Edit Social Preview

Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove