Paper

OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

Current 3D open-vocabulary scene understanding methods mostly utilize well-aligned 2D images as the bridge to learn 3D features with language. However, applying these approaches becomes challenging in scenarios where 2D images are absent. In this work, we introduce a new pipeline, namely, OpenIns3D, which requires no 2D image inputs, for 3D open-vocabulary scene understanding at the instance level. The OpenIns3D framework employs a "Mask-Snap-Lookup" scheme. The "Mask" module learns class-agnostic mask proposals in 3D point clouds. The "Snap" module generates synthetic scene-level images at multiple scales and leverages 2D vision language models to extract interesting objects. The "Lookup" module searches through the outcomes of "Snap" with the help of Mask2Pixel maps, which contain the precise correspondence between 3D masks and synthetic images, to assign category names to the proposed masks. This 2D input-free and flexible approach achieves state-of-the-art results on a wide range of indoor and outdoor datasets by a large margin. Moreover, OpenIns3D allows for effortless switching of 2D detectors without re-training. When integrated with powerful 2D open-world models such as ODISE and GroundingDINO, excellent results were observed on open-vocabulary instance segmentation. When integrated with LLM-powered 2D models like LISA, it demonstrates a remarkable capacity to process highly complex text queries which require intricate reasoning and world knowledge. Project page: https://zheninghuang.github.io/OpenIns3D/

Results in Papers With Code
(↓ scroll down to see all results)