SortCut Sinkhorn Attention is a variant of Sparse Sinkhorn Attention where a post-sorting truncation of the input sequence is performed, essentially performing a hard top-k operation on the input sequence blocks within the computational graph. While most attention models mainly re-weight or assign near-zero weights during training, this allows for explicitly and dynamically truncate the input sequence. Specifically:
$$ Y = \text{Softmax}\left(Q{\psi_{S}}\left(K\right)^{T}_{\left[:n\right]}\right)\psi_{S}\left(V\right)_{\left[:n\right]} $$
where $n$ is the Sortfut budget hyperparameter.
Source: Sparse Sinkhorn AttentionPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Document Classification | 1 | 25.00% |
Image Generation | 1 | 25.00% |
Language Modelling | 1 | 25.00% |
Natural Language Inference | 1 | 25.00% |
Component | Type |
|
---|---|---|
Feedforward Network
|
Feedforward Networks | |
ReLU
|
Activation Functions | |
Softmax
|
Output Functions |