Transformers

Sparse Transformer

Introduced by Child et al. in Generating Long Sequences with Sparse Transformers

A Sparse Transformer is a Transformer based architecture which utilises sparse factorizations of the attention matrix to reduce time/memory to $O(n \sqrt{n})$. Other changes to the Transformer architecture include: (a) a restructured residual block and weight initialization, (b) A set of sparse attention kernels which efficiently compute subsets of the attention matrix, (c) recomputation of attention weights during the backwards pass to reduce memory usage

Source: Generating Long Sequences with Sparse Transformers

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Language Modelling 6 7.89%
Text Classification 4 5.26%
Object Detection 3 3.95%
Decoder 3 3.95%
Question Answering 3 3.95%
Machine Translation 3 3.95%
Translation 3 3.95%
Image Restoration 2 2.63%
Semantic Segmentation 2 2.63%

Categories