CTAL

Introduced by Li et al. in CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations

CTAL is a pre-training framework for strong audio-and-language representations with a Transformer, which aims to learn the intra-modality and inter-modalities connections between audio and language through two proxy tasks on a large amount of audio- and-language pairs: masked language modeling and masked cross-modal acoustic modeling. The pre-trained model is a Transformer for Audio and Language, i.e., CTAL, which consists of two modules, a language stream encoding module which adapts word as input element, and a text-referred audio stream encoder module which accepts both frame-level Mel-spectrograms and token-level output embeddings from the language stream

Source: CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Emotion Classification	1	25.00%
Language Modelling	1	25.00%
Sentiment Analysis	1	25.00%
Speaker Verification	1	25.00%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Dense Connections	Feedforward Networks
Layer Normalization	Normalization
Multi-Head Attention	Attention Modules
Residual Connection	Skip Connections
Scaled Dot-Product Attention	Attention Mechanisms

Categories

Add Remove

Generative Audio Models

Multi-Modal Methods