Disentangled Mask Attention in Transformer

29 Sep 2021 · Jen-Tzung Chien, Yu-Han Huang ·

Transformer conducts self attention which has achieved state-of-the-art performance in many applications. Multi-head attention in transformer basically gathers the features from individual tokens in input sequence to form the mapping to output sequence. There are twofold weaknesses in learning representation using transformer. First, due to the natural property that attention mechanism would mix up the features of different tokens in input and output sequences, it is likely that the representation of input tokens contains redundant information. Second, the patterns of attention weights between different heads tend to be similar, the representation capacity of the model might be bounded. To strengthen the sequential learning representation, this paper presents a new disentangled mask attention in transformer where the redundant features are reduced and the semantic information is enriched. Latent disentanglement in multi-head attention is learned. The attention weights are filtered by a mask which is optimized by semantic clustering. The proposed attention mechanism is implemented for sequential learning according to the clustered disentanglement objective. The experiments on machine translation show the merit of this disentangled transformer in sequence-to-sequence learning tasks.

PDF Abstract