SGTR: Generating Scene Graph by Learning Compositional Triplets with Transformer

29 Sep 2021 · Rongjie Li, Songyang Zhang, Xuming He ·

In this work, we propose an end-to-end framework for the scene graph generation. Motivated by the recently introduced DETR, our method, termed SGTR, generating scene graphs by learning compositional queries with Transformers. We develop a decoding-and-assembling paradigm for the end-to-end scene graph generation. Based on a shared backbone, the overall structure first consists of two parallel branches: entity detector and triplet constructor, followed by a newly designed assembling mechanism. Specifically, each triplet is constructed by a set of the compositional queries in the triplet constructor. The predicate queries and entity queries are learned simultaneously with explicit information exchange. In the training phase, the grouping mechanism is learned by matching the decoded triplets with the outcome of the entity detector. Extensive experimental results show that SGTR can achieve state-of-the-art performance, surpassing most of the existing approaches. Moreover, the sparse queries significantly improving the efficiency of scene graph generation. We hope our SGTR can serve as a strong baseline for the Transformer-based scene graph generation.

PDF Abstract