Paper tables with annotated results for GEST: the Graph of Events in Space and Time as a Common Representation between Vision and Language

Paper

GEST: the Graph of Events in Space and Time as a Common Representation between Vision and Language

One of the essential human skills is the ability to seamlessly build an inner representation of the world. By exploiting this representation, humans are capable of easily finding consensus between visual, auditory and linguistic perspectives. In this work, we set out to understand and emulate this ability through an explicit representation for both vision and language - Graphs of Events in Space and Time (GEST). GEST alows us to measure the similarity between texts and videos in a semantic and fully explainable way, through graph matching. It also allows us to generate text and videos from a common representation that provides a well understood content. In this work we show that the graph matching similarity metrics based on GEST outperform classical text generation metrics and can also boost the performance of state of art, heavily trained metrics.

PDF Paper record

Results in Papers With Code

(↓ scroll down to see all results)

GEST: the Graph of Events in Space and Time as a Common Representation between Vision and Language

Reader Guidelines

Editor Guidelines