Functionality understanding and segmentation in 3D scenes

Abstract

Understanding functionalities in 3D scenes involves interpreting natural language descriptions to locate functional interactive objects, such as handles and buttons, in a 3D environment. Functionality understanding is highly challenging, as it requires both world knowledge to interpret language and spatial perception to identify fine-grained objects. We introduce Fun3DU, the first approach designed for functionality understanding in 3D scenes. Fun3DU uses a language model to parse the task description through Chain-of-Thought reasoning in order to identify the object of interest. The identified object is segmented across multiple views of the captured scene by using a vision and language model. The segmentation results from each view are lifted in 3D and aggregated into the point cloud using geometric information. Fun3DU is training-free, relying entirely on pre-trained models. We evaluate Fun3DU on SceneFun3D, the most recent and only dataset to benchmark this task, which comprises over 3000 task descriptions on 230 scenes. Our method significantly outperforms state-of-the-art open-vocabulary 3D segmentation approaches.

Method

We design Fun3DU as a training-free method that leverages VLMs to comprehend task descriptions and segment functional objects, often not explicitly mentioned in the description. Fun3DU is based on four key modules that process multiple views of a given scene and project the results in 3D. The first module interprets the task description to explain the functionality and context through Chain-of-Thought reasoning. The second module locates contextual objects via open-vocabulary segmentation to improve accuracy and efficiency in masking the functional objects within each view. Moreover, it employs of a novel visibility-based view selection approach to reduce the number of views from thousands to tens informative ones. The third module segments the functional objects on this view subset using a 2D VLM. The fourth module performs multi-view agreement by lifting and aggregating the 2D masks into the 3D point cloud using point-to-pixel correspondences.

Qualitative results on SceneFun3D

We evaluate Fun3DU on SceneFun3D [1], which comprises 230 scenes for a total of 3000 task descriptions. We compare against state-of-the-art models for open-vocabulary 3D scene segmentation: OpenMask3D [2], LERF [3], and OpenIns3D [4]. Fun3DU is capable of obtaining accurate masks of the functional objects, while other methods are prone to segmenting whole objects.

Related work

[1] Delitzas, Alexandros, et al. "Scenefun3D: Fine-grained functionality and affordance understanding in 3D scenes." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[2] Takmaz, Ayca, et al. "OpenMask3D: Open-Vocabulary 3D Instance Segmentation." Advances in Neural Information Processing Systems. 2024.

[3] Kerr, Justin, et al. "Lerf: Language embedded radiance fields." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[4] Huang, Zhening, et al. "Openins3D: Snap and lookup for 3d open-vocabulary instance segmentation." European Conference on Computer Vision. 2025.

Citation

If you find Fun3DU useful for your work, please cite:

BibTeX
  @inproceedings{corsetti2025fun3du,
    title={Functionality understanding and segmentation in 3D scenes},
    author={Corsetti, Jaime and Giuliari, Francesco and Fasoli, Alice and Boscaini, Davide and Poiesi, Fabio},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year={2025}
  }

Fun3DU: Functionality understanding and segmentation in 3D scenes

Abstract

Method

Qualitative results on SceneFun3D

Related work

Citation