NOVIS: A Case for End-to-End Near-Online Video Instance Segmentation
Until recently, the Video Instance Segmentation (VIS) community operated under the common belief that offline methods are generally superior to a frame by frame online processing. However, the recent success of online methods questions this belief, in particular, for challenging and long video sequences. We understand this work as a rebuttal of those recent observations and an appeal to the community to focus on dedicated near-online VIS approaches. To support our argument, we present a detailed analysis on different processing paradigms and the new end-to-end trainable NOVIS (Near-Online Video Instance Segmentation) method. Our transformer-based model directly predicts spatio-temporal mask volumes for clips of frames and performs instance tracking between clips via overlap embeddings. NOVIS represents the first near-online VIS approach which avoids any handcrafted tracking heuristics. We outperform all existing VIS methods by large margins and provide new state-of-the-art results on both YouTube-VIS (2019/2021) and the OVIS benchmarks.
PDF AbstractDatasets
Results from the Paper
Ranked #5 on Video Instance Segmentation on YouTube-VIS validation (using extra training data)
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Uses Extra Training Data |
Benchmark |
---|---|---|---|---|---|---|---|
Video Instance Segmentation | OVIS validation | NOVIS (ResNet-50) | mask AP | 32.7 | # 26 | ||
AP50 | 56.2 | # 22 | |||||
AP75 | 32.6 | # 25 | |||||
AR1 | 15.7 | # 20 | |||||
AR10 | 37.1 | # 21 | |||||
Video Instance Segmentation | OVIS validation | NOVIS (Swin-L) | mask AP | 43.5 | # 11 | ||
AP50 | 68.3 | # 12 | |||||
AP75 | 43.8 | # 14 | |||||
AR1 | 19.4 | # 3 | |||||
AR10 | 46.9 | # 12 | |||||
Video Instance Segmentation | YouTube-VIS 2021 | NOVIS (ResNet-50) | mask AP | 47.2 | # 21 | ||
AP50 | 69.4 | # 20 | |||||
AP75 | 50.0 | # 21 | |||||
AR10 | 54.4 | # 21 | |||||
AR1 | 41.3 | # 20 | |||||
Video Instance Segmentation | YouTube-VIS 2021 | NOVIS (Swin-L) | mask AP | 59.8 | # 8 | ||
AP50 | 82.0 | # 5 | |||||
AP75 | 66.5 | # 7 | |||||
AR10 | 64.4 | # 8 | |||||
AR1 | 47.9 | # 6 | |||||
Video Instance Segmentation | YouTube-VIS validation | NOVIS (ResNet-50) | mask AP | 52.8 | # 22 | ||
AP50 | 75.7 | # 20 | |||||
AP75 | 56.9 | # 20 | |||||
AR1 | 50.3 | # 16 | |||||
AR10 | 60.6 | # 16 | |||||
Video Instance Segmentation | YouTube-VIS validation | NOVIS (Swin-L) | mask AP | 65.7 | # 5 | ||
AP50 | 87.8 | # 5 | |||||
AP75 | 72.2 | # 5 | |||||
AR1 | 56.3 | # 4 | |||||
AR10 | 70.3 | # 3 |