MMT: Image-guided Story Ending Generation with Multimodal Memory Transformer
As a specific form of story generation, Image-guided Story Ending Generation (IgSEG) is a recently proposed task of generating a story ending for a given multi-sentence story plot and an ending-related image. Unlike existing image captioning tasks or story ending generation tasks, IgSEG aims to generate a factual description that conforms to both the contextual logic and the relevant visual concepts. To date, existing methods for IgSEG ignore the relationships between the multimodal information and do not integrate multimodal features appropriately. Therefore, in this work, we propose Multimodal Memory Transformer (MMT), an end-to-end framework that models and fuses both contextual and visual information to effectively capture the multimodal dependency for IgSEG. Firstly, we extract textual and visual features separately by employing modality-specific large-scale pretrained encoders. Secondly, we utilize the memory-augmented cross-modal attention network to learn cross-modal relationships and conduct the fine-grained feature fusion effectively. Finally, a multimodal transformer decoder constructs attention among multimodal features to learn the story dependency and generates informative, reasonable, and coherent story endings. In experiments, extensive automatic evaluation results and human evaluation results indicate the significant performance boost of our proposed MMT over state-of-the-art methods on two benchmark datasets.
PDF AbstractCode
Results from the Paper
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Image-guided Story Ending Generation | LSMDC-E | MMT | BLEU-1 | 18.52 | # 1 | |
BLEU-2 | 5.99 | # 1 | ||||
BLEU-3 | 2.51 | # 1 | ||||
BLEU-4 | 1.13 | # 1 | ||||
METEOR | 12.87 | # 1 | ||||
CIDEr | 12.41 | # 1 | ||||
ROUGE-L | 20.99 | # 1 | ||||
Image-guided Story Ending Generation | VIST-E | MMT | BLEU-1 | 22.87 | # 1 | |
BLEU-2 | 8.68 | # 1 | ||||
BLEU-3 | 4.38 | # 1 | ||||
BLEU-4 | 2.61 | # 1 | ||||
METEOR | 15.55 | # 1 | ||||
CIDEr | 25.41 | # 1 | ||||
ROUGE-L | 23.61 | # 1 |