TAM is designed to capture complex temporal relationships both efficiently and flexibly, It adopts an adaptive kernel instead of self-attention to capture global contextual information, with lower time complexity than GLTR.
TAM has two branches, a local branch and a global branch. Given the input feature map $X\in \mathbb{R}^{C\times T\times H\times W}$, global spatial average pooling $\text{GAP}$ is first applied to the feature map to ensure TAM has a low computational cost. Then the local branch in TAM employs several 1D convolutions with ReLU nonlinearity across the temporal domain to produce location-sensitive importance maps for enhancing frame-wise features. The local branch can be written as \begin{align} s &= \sigma(\text{Conv1D}(\delta(\text{Conv1D}(\text{GAP}(X))))) \end{align} \begin{align} X^1 &= s X \end{align} Unlike the local branch, the global branch is location invariant and focuses on generating a channel-wise adaptive kernel based on global temporal information in each channel. For the $c$-th channel, the kernel can be written as
\begin{align} \Theta_c = \text{Softmax}(\text{FC}_2(\delta(\text{FC}_1(\text{GAP}(X)_c)))) \end{align}
where $\Theta_c \in \mathbb{R}^{K}$ and $K$ is the adaptive kernel size. Finally, TAM convolves the adaptive kernel $\Theta$ with $ X_\text{out}^1$: \begin{align} Y = \Theta \otimes X^1 \end{align}
With the help of the local branch and global branch, TAM can capture the complex temporal structures in video and enhance per-frame features at low computational cost. Due to its flexibility and lightweight design, TAM can be added to any existing 2D CNNs.
Source: TAM: Temporal Adaptive Module for Video RecognitionPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Action Recognition | 3 | 7.14% |
Semantic Segmentation | 2 | 4.76% |
Multivariate Time Series Forecasting | 1 | 2.38% |
Time Series Forecasting | 1 | 2.38% |
Classification | 1 | 2.38% |
Combinatorial Optimization | 1 | 2.38% |
Anomaly Detection | 1 | 2.38% |
Graph Anomaly Detection | 1 | 2.38% |
Human-Object Interaction Detection | 1 | 2.38% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |