Probabilistic Multimodal Representation Learning

1 Jan 2021 · Leila Pishdad, Ran Zhang, Afsaneh Fazly, Allan Jepson ·

Learning multimodal representations is a requirement for many tasks such as image--caption retrieval. Previous work on this problem has only focused on finding good vector representations without any explicit measure of uncertainty. In this work, we argue and demonstrate that learning multimodal representations as probability distributions can lead to better representations, as well as providing other benefits such as adding a measure of uncertainty to the learned representations. We show that this measure of uncertainty can capture how confident our model is about the representations in the multimodal domain, i.e, how clear it is for the model to retrieve/predict the matching pair. We experiment with similarity metrics that have not been traditionally used for the multimodal retrieval task, and show that the choice of the similarity metric affects the quality of the learned representations.

PDF Abstract