Open-Vocabulary Functional 3D Human-Scene Interaction Generation
FunHSI is a training-free framework that generates physically plausible and functionally correct 3D human–scene interactions from posed RGB-D observations and open-vocabulary task prompts.
Abstract
Generating 3D humans that functionally interact with 3D scenes remains an open problem with applications in embodied AI, robotics, and interactive content creation. The key challenge involves reasoning about both the semantics of functional elements in 3D scenes and the 3D human poses required to achieve functionality-aware interaction. Unfortunately, existing methods typically lack explicit reasoning over object functionality and the corresponding human-scene contact, resulting in implausible or functionally incorrect interactions. We propose FunHSI, a training-free, functionality-driven framework that enables functionally correct human-scene interactions from open-vocabulary task prompts. Given a task prompt, FunHSI performs functionality-aware contact reasoning to identify functional scene elements, reconstruct their 3D geometry, and model high-level interactions via a contact graph. We then leverage vision-language models to synthesize a human performing the task in the image and estimate proposed 3D body and hand poses. Finally, the proposed 3D body configuration is refined via stage-wise optimization to ensure physical plausibility and functional correctness. FunHSI supports both general interactions (e.g., “sitting on a sofa”) and fine-grained functional interactions (e.g., “increasing the room temperature”), and consistently generates functionally correct and physically plausible interactions across diverse indoor and outdoor scenes.
Video
How It Works
General Human-Scene Interaction
Squatting in front of washing machine
Walking in front of the left wooden cabinet
Functional Human-Scene Interaction
Adjusting the temperature
Opening the bottom drawer of the leftmost wooden cabinet with the books on top
Opening the window to the left of the couch
Adjusting the room's temperature using the dial next to the door
Switching to a station on the radio on the bedside table
Opening the top drawer of the wooden nightstand to the left of the bed
Human-Scene Interaction in Real-world City Scenes
(These scenes were captured from Munich using iPhone 14 Pro Max.)
Buying a parking ticket
Decorating the Christmas tree
Pinning a paper to the whiteboard
Buying a metro ticket
Opening the emergency door
Sitting on a bench
Citation
If you find this work useful, please consider citing:
@misc{liu2026openvocabularyfunctional3dhumanscene,
title={Open-Vocabulary Functional 3D Human-Scene Interaction Generation},
author={Jie Liu and Yu Sun and Alpar Cseke and Yao Feng and Nicolas Heron and Michael J. Black and Yan Zhang},
year={2026},
eprint={2601.20835},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.20835},
}
Acknowledgements
We sincerely thank Alexandros Delitzas and Francis Engelmann for guidance on SceneFun3D; Priyanka Patel for guidance on CameraHMR; and Muhammed Kocabas for fruitful discussions on foundation models. We also thank Nitin Saini and Nathan Bajandas for help with Unreal Engine. This work was done when Jie Liu was an intern at Meshcapade.
Last updated: January, 2026