VideoXum is an enriched large-scale dataset for cross-modal video summarization. The dataset is built on ActivityNet Captions. The datasets includes three subtasks: Video-to-Video Summarization (V2V-SUM), Video-to-Text Summarization (V2T-SUM), and Video-to-Video&Text Summarization (V2VT-SUM).
Paper | Code | Results | Date | Stars |
---|