Recognition: unknown
InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement
Pith reviewed 2026-05-10 03:00 UTC · model grok-4.3
The pith
InHabit automatically generates large-scale 3D data of humans interacting with scenes by chaining 2D vision models to propose actions, insert figures, and optimize the results into scene-aligned SMPL-X bodies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
InHabit follows a render-generate-lift principle: given a rendered 3D scene, a vision-language model proposes contextually meaningful actions, an image-editing model inserts a human, and an optimization procedure lifts the edited result into physically plausible SMPL-X bodies aligned with the scene geometry. Applied to Habitat-Matterport3D, InHabit produces the first large-scale photorealistic 3D human-scene interaction dataset, containing 78K samples across 800 building-scale scenes with complete 3D geometry, SMPL-X bodies, and RGB images. Augmenting standard training data with our samples improves RGB-based 3D human-scene reconstruction and contact estimation, and in a perceptual userstudy
What carries the argument
The render-generate-lift pipeline, which uses a vision-language model for action proposal, an image-editing model for human insertion, and optimization to fit SMPL-X parameters to 3D scene geometry.
If this is right
- Augmenting existing training sets with InHabit samples raises accuracy on RGB-based 3D human-scene reconstruction.
- Contact estimation between humans and scenes improves when models are trained with the generated data.
- The new samples are rated higher than state-of-the-art synthetic data in 78 percent of direct perceptual comparisons.
- The pipeline scales automatically to hundreds of large indoor environments without manual annotation.
Where Pith is reading between the lines
- The same pipeline could be run on other large 3D scene collections to expand the range of environments covered.
- Generated interaction data might serve as a starting point for training agents that predict human actions in novel spaces.
- Refinements to the optimization step could further reduce any residual pose artifacts that current models leave behind.
Load-bearing premise
Off-the-shelf vision-language and image-editing models will produce contextually appropriate suggestions and placements that the subsequent optimization can turn into physically plausible 3D bodies without persistent artifacts or impossible configurations.
What would settle it
Generating a fresh batch of samples and counting the fraction of resulting SMPL-X bodies that exhibit interpenetrations with scene geometry or floating contacts after optimization.
Figures
read the original abstract
Training embodied agents to understand 3D scenes as humans do requires large-scale data of people meaningfully interacting with diverse environments, yet such data is scarce. Real-world motion capture is costly and limited to controlled settings, while existing synthetic datasets rely on simple geometric heuristics that ignore rich scene context. In contrast, 2D foundation models trained on internet-scale data have implicitly acquired commonsense knowledge of human-environment interactions. To transfer this knowledge into 3D, we introduce InHabit, a fully automatic and scalable data generator for populating 3D scenes with interacting humans. InHabit follows a render-generate-lift principle: given a rendered 3D scene, a vision-language model proposes contextually meaningful actions, an image-editing model inserts a human, and an optimization procedure lifts the edited result into physically plausible SMPL-X bodies aligned with the scene geometry. Applied to Habitat-Matterport3D, InHabit produces the first large-scale photorealistic 3D human-scene interaction dataset, containing 78K samples across 800 building-scale scenes with complete 3D geometry, SMPL-X bodies, and RGB images. Augmenting standard training data with our samples improves RGB-based 3D human-scene reconstruction and contact estimation, and in a perceptual user study our data is preferred in 78% of cases over the state of the art.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces InHabit, a fully automatic pipeline that applies a render-generate-lift strategy to 3D scenes: a vision-language model proposes contextually meaningful actions, an image-editing model inserts a human figure, and an optimization step lifts the result into physically plausible SMPL-X bodies aligned with scene geometry. Applied to Habitat-Matterport3D, the method generates a dataset of 78K samples across 800 building-scale scenes containing complete 3D geometry, SMPL-X bodies, and RGB images. The authors claim that augmenting standard training data with these samples improves RGB-based 3D human-scene reconstruction and contact estimation, and that the generated data is preferred in 78% of cases over the state of the art in a perceptual user study.
Significance. If the pipeline reliably produces artifact-free, physically plausible samples at the claimed scale, the work would be significant for embodied AI and 3D vision: it offers a scalable route to context-rich human-scene interaction data that leverages commonsense knowledge implicit in internet-scale 2D foundation models, going beyond geometric heuristics or limited mocap. The render-generate-lift paradigm and the resulting dataset size represent a concrete advance in data generation for tasks requiring human-environment understanding.
major comments (3)
- [Abstract and §3] Abstract and pipeline description (§3): The central claim of producing the first large-scale photorealistic 3D HSI dataset of 78K valid samples depends on the render-generate-lift pipeline consistently yielding artifact-free, physically plausible SMPL-X bodies without penetrations, floating, or implausible contacts. No quantitative metrics (e.g., optimization success rate, average penetration depth, contact accuracy, or failure-mode analysis) are reported to substantiate that a non-negligible fraction of samples are not invalid due to editing artifacts or under-constrained lifts.
- [§5] §5 (downstream experiments): The stated improvements to RGB-based 3D human-scene reconstruction and contact estimation are presented without specific quantitative results, baseline comparisons, ablation tables, or error breakdowns showing the incremental gain from adding InHabit samples versus standard training data alone.
- [User study section] User study (perceptual evaluation): The 78% preference rate is reported without details on study design, number of participants, question phrasing, number of comparisons per participant, or statistical significance testing, making it impossible to assess whether the result robustly supports the data-quality claim.
minor comments (2)
- [§3.3] Clarify the exact objective function, constraints, and convergence criteria used in the SMPL-X optimization step, including any regularization terms for contact and penetration.
- [§4] Add a table or figure summarizing the distribution of proposed actions, scene categories, and body poses in the final 78K dataset to allow readers to judge diversity.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below and commit to revisions that will strengthen the quantitative support for our claims.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and pipeline description (§3): The central claim of producing the first large-scale photorealistic 3D HSI dataset of 78K valid samples depends on the render-generate-lift pipeline consistently yielding artifact-free, physically plausible SMPL-X bodies without penetrations, floating, or implausible contacts. No quantitative metrics (e.g., optimization success rate, average penetration depth, contact accuracy, or failure-mode analysis) are reported to substantiate that a non-negligible fraction of samples are not invalid due to editing artifacts or under-constrained lifts.
Authors: We agree that quantitative metrics are essential to validate the pipeline's reliability at scale. The current manuscript relies on qualitative examples and the downstream perceptual study to support the 78K valid samples claim. In the revised version we will add an analysis section reporting optimization success rate, average penetration depth (using standard SMPL-X penetration metrics), contact accuracy where ground truth is available, and a summary of failure modes and filtering steps applied to reach the final dataset size. revision: yes
-
Referee: [§5] §5 (downstream experiments): The stated improvements to RGB-based 3D human-scene reconstruction and contact estimation are presented without specific quantitative results, baseline comparisons, ablation tables, or error breakdowns showing the incremental gain from adding InHabit samples versus standard training data alone.
Authors: The manuscript states that augmenting standard training data with InHabit samples improves reconstruction and contact estimation, but we acknowledge the lack of detailed quantitative breakdowns. We will revise §5 to include explicit numerical results (e.g., MPJPE, contact F1, or equivalent metrics), direct comparisons against baselines trained without InHabit data, ablation tables isolating the contribution of the new samples, and per-scene or per-category error analyses. revision: yes
-
Referee: [User study section] User study (perceptual evaluation): The 78% preference rate is reported without details on study design, number of participants, question phrasing, number of comparisons per participant, or statistical significance testing, making it impossible to assess whether the result robustly supports the data-quality claim.
Authors: We agree that the user-study description is insufficient for reproducibility and credibility assessment. In the revision we will expand the user-study section with the exact protocol: number of participants, recruitment method, precise question wording and interface, number of pairwise comparisons shown per participant, randomization procedure, and any statistical significance tests (e.g., binomial test or p-values) supporting the 78% preference. revision: yes
Circularity Check
No circularity: pipeline uses external models to generate independent dataset
full rationale
The derivation chain consists of applying off-the-shelf VLMs and image-editing models (external to the paper) followed by an optimization lift to SMPL-X, then empirically evaluating the resulting 78K-sample dataset via augmentation experiments and a separate perceptual study. No equations, parameters, or claims reduce by construction to fitted inputs, self-definitions, or author self-citations; the central results are new data plus downstream measurements that are falsifiable outside the generation process itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption 2D foundation models trained on internet-scale data have implicitly acquired commonsense knowledge of human-environment interactions.
invented entities (1)
-
InHabit render-generate-lift pipeline
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., et al.: GPT-4 technical report. arXiv:2303.08774 (2023) 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
In: CVPR (2023) 4
Araujo, J.P., Li, J., Vetrivel, K., Agarwal, R., Gopinath, D., Wu, J., Clegg, A., Liu, C.K.: CIRCLE: Capture in rich contextual environments. In: CVPR (2023) 4
2023
-
[3]
In: ECCV (2024) 10
Baradel, F., Armando, M., Galaaoui, S., Brégier, R., Weinzaepfel, P., Rogez, G., Lucas, T.: Multi-HMR: Multi-person whole-body human mesh recovery in a single shot. In: ECCV (2024) 10
2024
-
[4]
In: CVPR (2023) 4, 13
Black, M.J., Patel, P., Tesch, J., Yang, J.: BEDLAM: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In: CVPR (2023) 4, 13
2023
-
[5]
In: CVPR (2023) 2
Brooks, T., Holynski, A., Efros, A.A.: InstructPix2Pix: Learning to follow image editing instructions. In: CVPR (2023) 2
2023
-
[6]
In: ECCV (2020) 4
Cao, Z., Gao, H., Mangalam, K., Cai, Q.Z., Vo, M., Malik, J.: Long-term human motion prediction with scene context. In: ECCV (2020) 4
2020
-
[7]
In: ICLR (2026) 3, 13
Chen, Y., Chen, X., Xue, Y., Chen, A., Xiu, Y., Pons-Moll, G.: Human3R: Every- one everywhere all at once. In: ICLR (2026) 3, 13
2026
-
[8]
In: CVPR (2017) 2
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scan- Net: Richly-annotated 3D reconstructions of indoor scenes. In: CVPR (2017) 2
2017
-
[9]
In: ECCV (2022) 9, 11
Delmas, G., Weinzaepfel, P., Lucas, T., Moreno-Noguer, F., Rogez, G.: PoseScript: 3D human poses from natural language. In: ECCV (2022) 9, 11
2022
-
[10]
In: ECCV (2024) 11
Delmas, G., Weinzaepfel, P., Moreno-Noguer, F., Rogez, G.: Poseembroider: To- wards a 3d, visual, semantic-aware human pose representation. In: ECCV (2024) 11
2024
-
[11]
TPAMI (2024) 3
Ge, Y., Wang, W., Chen, Y., Chen, H., Shen, C.: 3d human reconstruction in the wild with synthetic data using generative models. TPAMI (2024) 3
2024
-
[12]
Google DeepMind: Gemini 2.5: Our most intelligent AI model. arXiv:2507.06261 (2025) 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Google DeepMind: Gemini 3.https://deepmind.google/models/gemini/(2025) 9
2025
-
[14]
In: 3DV (2024) 4
Guzov, V., Chibane, J., Marin, R., He, Y., Saracoglu, Y., Sattler, T., Pons-Moll, G.: Interaction replica: Tracking human-object interaction and scene changes from human motion. In: 3DV (2024) 4
2024
-
[15]
In: CVPR (2021) 2, 4
Guzov, V., Mir, A., Sattler, T., Pons-Moll, G.: Human POSEitioning system (HPS): 3D human pose estimation and self-localization in large scenes from body- mounted sensors. In: CVPR (2021) 2, 4
2021
-
[16]
In: ICCV (2019) 2, 3, 4, 6, 13
Hassan, M., Choutas, V., Tzionas, D., Black, M.J.: Resolving 3D human pose ambiguities with 3D scene constraints. In: ICCV (2019) 2, 3, 4, 6, 13
2019
-
[17]
In: CVPR (Jun 2021) 2, 3, 6, 11
Hassan, M., Ghosh, P., Tesch, J., Tzionas, D., Black, M.J.: Populating 3D scenes by learning human-scene interaction. In: CVPR (Jun 2021) 2, 3, 6, 11
2021
-
[18]
In: CVPR (2022) 2, 4, 13
Huang, C.H.P., Yi, H., Höschle, M., Safroshkin, M., Alexiadis, T., Polikovsky, S., Scharstein, D., Black, M.J.: Capturing and inferring dense full-body human-scene contact. In: CVPR (2022) 2, 4, 13
2022
-
[19]
In: CVPR (2024) 4, 13
Jiang, N., Zhang, Z., Li, H., Ma, X., Wang, Z., Chen, Y., Liu, T., Zhu, Y., Huang, S.: Scaling up dynamic human-scene interaction modeling. In: CVPR (2024) 4, 13
2024
-
[20]
In: ICCV (2025) 4
Kim, H., Baik, S., Joo, H.: DAViD: Modeling dynamic affordance of 3D objects using pre-trained video diffusion models. In: ICCV (2025) 4
2025
-
[21]
In: ECCV (2024) 3 Inhabit 17
Kim, H., Han, S., Kwon, P., Joo, H.: Beyond the contact: Discovering compre- hensive affordance for 3d objects from pre-trained 2d diffusion models. In: ECCV (2024) 3 Inhabit 17
2024
-
[22]
In: 3DV (2026) 4
Kister, N., Sárándi, I., Khoreva, A., Pons-Moll, G.: Are pose estimators ready for the open world? STAGE: Synthetic data generation toolkit for auditing 3d human pose estimators. In: 3DV (2026) 4
2026
-
[23]
In: CVPR (2023) 2
Kulal, S., Brooks, T., Aiken, A., Wu, J., Yang, J., Lu, J., Efros, A.A., Singh, K.K.: Putting people in their place: Affordance-aware human insertion into scenes. In: CVPR (2023) 2
2023
-
[24]
Li,L.,Dai,A.:GenZI:Zero-shot3Dhuman-sceneinteractiongeneration.In:CVPR (June 2024) 2, 3, 6, 11
2024
-
[25]
In: ICCV (2019) 2
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: Archive of motion capture as surface shapes. In: ICCV (2019) 2
2019
-
[26]
In: Proceedings IEEE/CVF Conf
Patel, P., Huang, C.H.P., Tesch, J., Hoffmann, D.T., Tripathi, S., Black, M.J.: AGORA: Avatars in geography optimized for regression analysis. In: Proceedings IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR) (Jun
-
[27]
In: CVPR
Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR. IEEE (Jun 2019) 2, 10
2019
-
[28]
In: ICLR (2024) 2, 11
Puig, X., Undersander, E., Szot, A., Cote, M.D., Batra, D., Mottaghi, R.: Habitat 3.0: A co-habitat for humans, avatars and robots. In: ICLR (2024) 2, 11
2024
-
[29]
Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI
Ramakrishnan, S.K., Gokaslan, A., Wijmans, E., Maksymets, O., Clegg, A., Turner, J.M., Undersander, E., Galuba, W., Westbury, A., Chang, A.X., Savva, M., Zhao, Y., Batra, D.: Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI. In: NeurIPS: Datasets and Benchmarks (2021), https://arxiv.org/abs/2109.082382, 5, 7
work page internal anchor Pith review arXiv 2021
-
[30]
ACM Transactions on Graphics (TOG)35(4) (2016) 4
Savva, M., Chang, A.X., Hanrahan, P., Fisher, M., Nießner, M.: PiGraphs: Learn- ing interaction snapshots from observations. ACM Transactions on Graphics (TOG)35(4) (2016) 4
2016
-
[31]
In: ICCV
Tripathi, S., Chatterjee, A., Passy, J.C., Yi, H., Tzionas, D., Black, M.J.: DECO: Dense estimation of 3D human-scene contact in the wild. In: ICCV. pp. 8001–8013 (October 2023) 3, 13
2023
-
[32]
In: CVPR (2025) 14
Wang*, Q., Zhang*, Y., Holynski, A., Efros, A.A., Kanazawa, A.: Continuous 3d perception model with persistent state. In: CVPR (2025) 14
2025
-
[33]
In: CVPR (2025) 7
Wang,R.,Xu,S.,Dai,C.,Xiang,J.,Deng,Y.,Tong,X.,Yang,J.:MoGe:Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. In: CVPR (2025) 7
2025
-
[34]
In: CVPR (2024) 2
Wang, T., Mao, X., Zhu, C., Xu, R., Lyu, R., Li, P., Chen, X., Zhang, W., Chen, K., Xue, T., et al.: EmbodiedScan: A holistic multi-modal 3D perception suite towards embodied AI. In: CVPR (2024) 2
2024
-
[35]
In: NeurIPS (2022) 4
Wang, Z., Chen, Y., Liu, T., Zhu, Y., Liang, W., Huang, S.: HUMANISE: Language-conditioned human motion generation in 3D scenes. In: NeurIPS (2022) 4
2022
-
[36]
In: TOG (Proc
Yalandur Muralidhar, P., Xue, Y., Xie, X., Kostyrko, M., Pons-Moll, G.: Physic: Physically plausible 3d human-scene interaction and contact from a single image. In: TOG (Proc. SIGGRAPH Asia) (2025) 2, 6
2025
-
[37]
https://pradyumnaym.github.io/graft/(2026), project website 3, 13
YM, P., Xue, Y., Chen, Y., Kister, N., Sárándi, I., Pons-Moll, G.: GRAFT: Ge- ometric Refinement and Fitting Transformer for Human Scene Reconstruction. https://pradyumnaym.github.io/graft/(2026), project website 3, 13
2026
-
[38]
Zhang, S., Zhang, Y., Ma, Q., Black, M.J., Tang, S.: Place: Proximity learning of articulation and contact in 3d environments. In: 3DV. pp. 642–651. IEEE (2020) 2, 3, 6 18 N. Kister et al
2020
-
[39]
In: ECCV (2022) 4
Zhang, X., Bhatnagar, B.L., Guzov, V., Starke, S., Pons-Moll, G.: COUCH: To- wards controllable human-chair interactions. In: ECCV (2022) 4
2022
-
[40]
In: ICML (2024) 2
Zhen, H., Qiu, X., Chen, P., Yang, J., Yan, X., Du, Y., Hong, Y., Gan, C.: 3D-VLA: A 3D vision-language-action generative world model. In: ICML (2024) 2
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.