no code implementations • 11 Mar 2024 • Erkut Akdag, Zeqi Zhu, Egor Bondarev, Peter H. N. de With
The model uses 2D-pose features as the positional embedding of the transformer architecture and spatio-temporal features as the main input to the encoder of the transformer.