Efficient ViTs
26 papers with code • 3 benchmarks • 0 datasets
Increasing the efficiency of ViTs without the modification of the architecture. (i.e., Key & Query Sparsification, Token pruning & merging)
Most implemented papers
Training data-efficient image transformers & distillation through attention
In this work, we produce a competitive convolution-free transformer by training on Imagenet only.
All Tokens Matter: Token Labeling for Training Better Vision Transformers
In this paper, we present token labeling -- a new training objective for training high-performance vision transformers (ViTs).
Fast Vision Transformers with HiLo Attention
Therefore, we propose to disentangle the high/low frequency patterns in an attention layer by separating the heads into two groups, where one group encodes high frequencies via self-attention within each local window, and another group encodes low frequencies by performing global attention between the average-pooled low-frequency keys and values from each window and each query position in the input feature map.
Pruning Self-attentions into Convolutional Layers in Single Path
Relying on the single-path space, we introduce learnable binary gates to encode the operation choices in MSA layers.
Token Merging: Your ViT But Faster
Off-the-shelf, ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2. 2x the throughput of ViT-L on video with only a 0. 2-0. 3% accuracy drop in each case.
Scalable Vision Transformers with Hierarchical Pooling
However, the routine of the current ViT model is to maintain a full-length patch sequence during inference, which is redundant and lacks hierarchical representation.
PPT: Token Pruning and Pooling for Efficient Vision Transformers
Vision Transformers (ViTs) have emerged as powerful models in the field of computer vision, delivering superior performance across various vision tasks.
DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification
Based on this observation, we propose a dynamic token sparsification framework to prune redundant tokens progressively and dynamically based on the input.
Chasing Sparsity in Vision Transformers: An End-to-End Exploration
For example, our sparsified DeiT-Small at (5%, 50%) sparsity for (data, architecture), improves 0. 28% top-1 accuracy, and meanwhile enjoys 49. 32% FLOPs and 4. 40% running time savings.
Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer
Vision transformers (ViTs) have recently received explosive popularity, but the huge computational cost is still a severe issue.