1 code implementation • 27 Apr 2022 • Alex Falcon, Swathikiran Sudhakaran, Giuseppe Serra, Sergio Escalera, Oswald Lanz
We show that even if we carefully tuned the fixed margin, our technique (which does not have the margin as a hyper-parameter) would still achieve better performance.
Ranked #7 on Multi-Instance Retrieval on EPIC-KITCHENS-100
1 code implementation • 16 Mar 2022 • Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz
3D kernel factorization approaches have been proposed to reduce the complexity of 3D CNNs.
Ranked #17 on Action Recognition on EPIC-KITCHENS-100 (using extra training data)
no code implementations • 6 Oct 2021 • Swathikiran Sudhakaran, Adrian Bulat, Juan-Manuel Perez-Rua, Alex Falcon, Sergio Escalera, Oswald Lanz, Brais Martinez, Georgios Tzimiropoulos
This report presents the technical details of our submission to the EPIC-Kitchens-100 Action Recognition Challenge 2021.
1 code implementation • NeurIPS 2021 • Adrian Bulat, Juan-Manuel Perez-Rua, Swathikiran Sudhakaran, Brais Martinez, Georgios Tzimiropoulos
In this work, we propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence and hence induces no overhead compared to an image-based Transformer model.
Ranked #32 on Action Classification on Kinetics-600
no code implementations • 16 Feb 2021 • Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz
We present EgoACO, a deep neural architecture for video action recognition that learns to pool action-context-object descriptors from frame level features by leveraging the verb-noun structure of action labels in egocentric video datasets.
no code implementations • 24 Jun 2020 • Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz
In this report we describe the technical details of our submission to the EPIC-Kitchens Action Recognition 2020 Challenge.
2 code implementations • CVPR 2020 • Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz
Deep 3D CNNs for video action recognition are designed to learn powerful representations in the joint spatio-temporal feature space.
Ranked #26 on Action Recognition on Something-Something V1 (using extra training data)
no code implementations • 2 Jul 2019 • Swathikiran Sudhakaran, Oswald Lanz
We review three recent deep learning based methods for action recognition and present a brief comparative analysis of the methods from a neurophyisiological point of view.
no code implementations • 21 Jun 2019 • Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz
In this report we describe the technical details of our submission to the EPIC-Kitchens 2019 action recognition challenge.
no code implementations • 29 May 2019 • Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz
Most action recognition methods base on a) a late aggregation of frame level CNN features using average pooling, max pooling, or RNN, among others, or b) spatio-temporal aggregation via 3D convolutions.
Ranked #51 on Action Recognition on HMDB-51 (using extra training data)
1 code implementation • CVPR 2019 • Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz
Egocentric activity recognition is one of the most challenging tasks in video analysis.
Ranked #5 on Egocentric Activity Recognition on EGTEA
no code implementations • 29 Aug 2018 • Swathikiran Sudhakaran, Oswald Lanz
Most recent approaches for action recognition from video leverage deep architectures to encode the video clip into a fixed length representation vector that is then used for classification.
1 code implementation • 31 Jul 2018 • Swathikiran Sudhakaran, Oswald Lanz
Our model is built on the observation that egocentric activities are highly characterized by the objects and their locations in the video.
Ranked #6 on Egocentric Activity Recognition on EGTEA
no code implementations • 19 Sep 2017 • Swathikiran Sudhakaran, Oswald Lanz
A convolutional neural network is used to extract frame level features from a video.
no code implementations • 19 Sep 2017 • Swathikiran Sudhakaran, Oswald Lanz
The proposed approach uses a pair of convolutional neural networks, whose parameters are shared, for extracting frame level features from successive frames of the video.