TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Document Summarization	Arxiv HEP-TH citation graph	DeepPyramidion	ROUGE-1	47.15	# 1
Text Summarization	Arxiv HEP-TH citation graph	Blockwise(baseline)	ROUGE-1	46.85	# 12
Text Summarization	Arxiv HEP-TH citation graph	Blockwise(baseline)	ROUGE-2	19.39	# 11
Text Summarization	Arxiv HEP-TH citation graph	DeepPyramidion	ROUGE-1	47.15	# 11
Text Summarization	Arxiv HEP-TH citation graph	DeepPyramidion	ROUGE-2	19.99	# 10
Document Summarization	arXiv Summarization Dataset	DeepPyramidion	Rouge-2	19.99	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/sparsifying-transformer-models-with-trainable/document-summarization-on-arxiv)](https://paperswithcode.com/sota/document-summarization-on-arxiv?p=sparsifying-transformer-models-with-trainable)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/sparsifying-transformer-models-with-trainable/document-summarization-on-arxiv-summarization)](https://paperswithcode.com/sota/document-summarization-on-arxiv-summarization?p=sparsifying-transformer-models-with-trainable)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/sparsifying-transformer-models-with-trainable/text-summarization-on-arxiv)](https://paperswithcode.com/sota/text-summarization-on-arxiv?p=sparsifying-transformer-models-with-trainable)`

Sparsifying Transformer Models with Trainable Representation Pooling

ACL ARR November 2021 · Anonymous ·

We propose a novel method to sparsify attention in the Transformer model by learning to select the most-informative token representations during the training process, thus focusing on the task-specific parts of an input. A reduction of quadratic time and memory complexity to sublinear was achieved due to a robust trainable top-$k$ operator. Our experiments on a challenging long document summarization task show that even our simple baseline performs comparably to the current SOTA, and with trainable pooling we can retain its top quality, while being $1.8\times$ faster during training, $4.5\times$ faster during inference and up to $13\times$ more computationally efficient in the decoder.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Decoder

Document Summarization

Text Summarization

Datasets

Arxiv HEP-TH citation graph arXiv Summarization Dataset

Results from the Paper

Add Remove

Ranked #1 on Document Summarization on arXiv Summarization Dataset

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Document Summarization	Arxiv HEP-TH citation graph	DeepPyramidion	ROUGE-1	47.15	# 1	Compare
Text Summarization	Arxiv HEP-TH citation graph	Blockwise(baseline)	ROUGE-1	46.85	# 12	Compare
Text Summarization	Arxiv HEP-TH citation graph	Blockwise(baseline)	ROUGE-2	19.39	# 11	Compare
Text Summarization	Arxiv HEP-TH citation graph	DeepPyramidion	ROUGE-1	47.15	# 11	Compare
Text Summarization	Arxiv HEP-TH citation graph	DeepPyramidion	ROUGE-2	19.99	# 10	Compare
Document Summarization	arXiv Summarization Dataset	DeepPyramidion	Rouge-2	19.99	# 1	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Sparsifying Transformer Models with Trainable Representation Pooling

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove