Open-Vocabulary Functional 3D Human-Scene Interaction Generation

Jie Liu1,2, Yu Sun1, Alpár Cseke3, Yao Feng4, Nicolas Heron1, Michael J. Black3, Yan Zhang1

1Meshcapade 2University of Amsterdam
3Max Planck Institute for Intelligent Systems 4Stanford University

FunHSI is a training-free framework that generates physically plausible and functionally correct 3D human–scene interactions from posed RGB-D observations and open-vocabulary task prompts.

FunHSI demo
FunHSI. Demo video showcasing functionality-aware 3D human–scene interactions across diverse scenes.

Abstract

Generating 3D humans that functionally interact with 3D scenes remains an open problem with applications in embodied AI, robotics, and interactive content creation. The key challenge involves reasoning about both the semantics of functional elements in 3D scenes and the 3D human poses required to achieve functionality-aware interaction. Unfortunately, existing methods typically lack explicit reasoning over object functionality and the corresponding human-scene contact, resulting in implausible or functionally incorrect interactions. We propose FunHSI, a training-free, functionality-driven framework that enables functionally correct human-scene interactions from open-vocabulary task prompts. Given a task prompt, FunHSI performs functionality-aware contact reasoning to identify functional scene elements, reconstruct their 3D geometry, and model high-level interactions via a contact graph. We then leverage vision-language models to synthesize a human performing the task in the image and estimate proposed 3D body and hand poses. Finally, the proposed 3D body configuration is refined via stage-wise optimization to ensure physical plausibility and functional correctness. FunHSI supports both general interactions (e.g., “sitting on a sofa”) and fine-grained functional interactions (e.g., “increasing the room temperature”), and consistently generates functionally correct and physically plausible interactions across diverse indoor and outdoor scenes.

Video

How It Works

FunHSI method overview
FunHSI takes a set of posed RGB-D images and an open-vocabulary task prompt as input, and generates a 3D human that accomplishes the specified task via functionally correct contact with the scene. The pipeline consists of three modules: (i) Functionality-aware Contact Reasoning, (ii) Functionality-aware Body Initialization , and (iii) Two-stage Body Refinement.
1
Functionality-aware Contact Reasoning
This module identifies task-relevant functional elements in the scene, reconstructs their 3D geometry, and performs contact graph reasoning to produce the high-level interactions.
2
Contact-aware Body Initialization
This module leverages the inferred functional elements and contact relations to synthesize a human in the image and estimate the 3D body and the hand poses
3
Two-stage Body Refinement
This module places the initialized 3D body into the 3D scene and performs stage-wise optimization to refine the body and hand poses, and human-scene contacts.

General Human-Scene Interaction

Squatting in front of washing machine

Walking in front of the left wooden cabinet

Functional Human-Scene Interaction

Adjusting the temperature

Opening the bottom drawer of the leftmost wooden cabinet with the books on top

Opening the window to the left of the couch

Adjusting the room's temperature using the dial next to the door

Switching to a station on the radio on the bedside table

Opening the top drawer of the wooden nightstand to the left of the bed

Human-Scene Interaction in Real-world City Scenes

(These scenes were captured from Munich using iPhone 14 Pro Max.)

Buying a parking ticket

Decorating the Christmas tree

Pinning a paper to the whiteboard

Buying a metro ticket

Opening the emergency door

Sitting on a bench

Citation

If you find this work useful, please consider citing:

@misc{liu2026openvocabularyfunctional3dhumanscene,
        title={Open-Vocabulary Functional 3D Human-Scene Interaction Generation}, 
        author={Jie Liu and Yu Sun and Alpar Cseke and Yao Feng and Nicolas Heron and Michael J. Black and Yan Zhang},
        year={2026},
        eprint={2601.20835},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2601.20835}, 
  }

Acknowledgements

We sincerely thank Alexandros Delitzas and Francis Engelmann for guidance on SceneFun3D; Priyanka Patel for guidance on CameraHMR; and Muhammed Kocabas for fruitful discussions on foundation models. We also thank Nitin Saini and Nathan Bajandas for help with Unreal Engine. This work was done when Jie Liu was an intern at Meshcapade.

Last updated: January, 2026