Non-maximum Suppression Also Closes the Variational Approximation Gap of Multi-object Variational Autoencoders
Learning object-centric scene representations is crucial for scene structural understanding. However, current unsupervised scene factorization and representation learning models do not reason about scene objects' relations while making an inference. In this paper, we address the issue by introducing a differentiable correlation prior that forces the inference models to suppress duplicate object representations. The extension is evaluated by adding it to three different scene understanding approaches. The results show that the models trained with the proposed method not only outperform the original models in scene factorization and have fewer duplicate representations, but also close the approximation gap between the data evidence and the evidence lower bound.
PDF Abstract