gMLP is an MLP-based alternative to Transformers without self-attention, which simply consists of channel projections and spatial projections with static parameterization. It is built out of basic MLP layers with gating. The model consists of a stack of $L$ blocks with identical size and structure. Let $X \in \mathbb{R}^{n \times d}$ be the token representations with sequence length $n$ and dimension $d$. Each block is defined as:
$$ Z=\sigma(X U), \quad \tilde{Z}=s(Z), \quad Y=\tilde{Z} V $$
where $\sigma$ is an activation function such as GeLU. $U$ and $V$ define linear projections along the channel dimension - the same as those in the FFNs of Transformers (e.g., their shapes are $768 \times 3072$ and $3072 \times 768$ for $\text{BERT}_{\text {base }}$).
A key ingredient is $s(\cdot)$, a layer which captures spatial interactions. When $s$ is an identity mapping, the above transformation degenerates to a regular FFN, where individual tokens are processed independently without any cross-token communication. One of the major focuses is therefore to design a good $s$ capable of capturing complex spatial interactions across tokens. This leads to the use of a Spatial Gating Unit which involves a modified linear gating.
The overall block layout is inspired by inverted bottlenecks, which define $s(\cdot)$ as a spatial depthwise convolution. Note, unlike Transformers, gMLP does not require position embeddings because such information will be captured in $s(\cdot)$.
Source: Pay Attention to MLPsPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Image Classification | 3 | 13.04% |
Instance Segmentation | 2 | 8.70% |
Object Detection | 2 | 8.70% |
Semantic Segmentation | 2 | 8.70% |
Question Answering | 2 | 8.70% |
Graph Representation Learning | 1 | 4.35% |
Node Classification | 1 | 4.35% |
Classification | 1 | 4.35% |
Multi-Label Classification | 1 | 4.35% |
Component | Type |
|
---|---|---|
GELU
|
Activation Functions | |
Layer Normalization
|
Normalization | |
Residual Connection
|
Skip Connections | |
Spatial Gating Unit
|
Feedforward Networks |