-->
Understanding functionalities in 3D scenes involves interpreting natural language descriptions to locate functional interactive objects, such as handles and buttons, in a 3D environment. Functionality understanding is highly challenging, as it requires both world knowledge to interpret language and spatial perception to identify fine-grained objects. We introduce Fun3DU, the first approach designed for functionality understanding in 3D scenes. Fun3DU uses a language model to parse the task description through Chain-of-Thought reasoning in order to identify the object of interest. The identified object is segmented across multiple views of the captured scene by using a vision and language model. The segmentation results from each view are lifted in 3D and aggregated into the point cloud using geometric information. Fun3DU is training-free, relying entirely on pre-trained models. We evaluate Fun3DU on SceneFun3D, the most recent and only dataset to benchmark this task, which comprises over 3000 task descriptions on 230 scenes. Our method significantly outperforms state-of-the-art open-vocabulary 3D segmentation approaches.
We evaluate Fun3DU on SceneFun3D [1], which comprises 230 scenes for a total of 3000 task descriptions. We compare against state-of-the-art models for open-vocabulary 3D scene segmentation: OpenMask3D [2], LERF [3], and OpenIns3D [4]. Fun3DU is capable of obtaining accurate masks of the functional objects, while other methods are prone to segmenting whole objects.
[1] Delitzas, Alexandros, et al. "Scenefun3D: Fine-grained functionality and affordance understanding in 3D scenes." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
[2] Takmaz, Ayca, et al. "OpenMask3D: Open-Vocabulary 3D Instance Segmentation." Advances in Neural Information Processing Systems. 2024.
[3] Kerr, Justin, et al. "Lerf: Language embedded radiance fields." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
[4] Huang, Zhening, et al. "Openins3D: Snap and lookup for 3d open-vocabulary instance segmentation." European Conference on Computer Vision. 2025.
@article{corsetti2024fun3du,
title={Functionality understanding and segmentation in 3D scenes},
author={Corsetti, Jaime and Giuliari, Francesco and Fasoli, Alice and Boscaini, Davide and Poiesi, Fabio},
journal={arXiv preprint arXiv:2411.16310},
year={2024}
}