no code implementations • NeurIPS Workshop SVRHM 2020 • George Barnum, Sabera Talukder, Yisong Yue
To facilitate the study of early multimodal fusion, we create a convolutional LSTM network architecture that simultaneously processes both audio and visual inputs, and allows us to select the layer at which audio and visual information combines.