CTAL is a pre-training framework for strong audio-and-language representations with a Transformer, which aims to learn the intra-modality and inter-modalities connections between audio and language through two proxy tasks on a large amount of audio- and-language pairs: masked language modeling and masked cross-modal acoustic modeling. The pre-trained model is a Transformer for Audio and Language, i.e., CTAL, which consists of two modules, a language stream encoding module which adapts word as input element, and a text-referred audio stream encoder module which accepts both frame-level Mel-spectrograms and token-level output embeddings from the language stream
Source: CTAL: Pre-training Cross-modal Transformer for Audio-and-Language RepresentationsPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Emotion Classification | 1 | 25.00% |
Language Modelling | 1 | 25.00% |
Sentiment Analysis | 1 | 25.00% |
Speaker Verification | 1 | 25.00% |