InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement

Anna Khoreva; Gerard Pons-Moll; Istv\'an S\'ar\'andi; Jiayi Wang; Nikita Kister; Pradyumna YM

arxiv: 2604.19673 · v2 · pith:2S42JA4Vnew · submitted 2026-04-21 · 💻 cs.CV

InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement

Nikita Kister , Pradyumna YM , Istv\'an S\'ar\'andi , Jiayi Wang , Anna Khoreva , Gerard Pons-Moll This is my paper

Pith reviewed 2026-05-10 03:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords human-scene interaction3D dataset generationfoundation modelsSMPL-X bodiesphotorealistic synthesisembodied AIscene population

0 comments

The pith

InHabit automatically generates large-scale 3D data of humans interacting with scenes by chaining 2D vision models to propose actions, insert figures, and optimize the results into scene-aligned SMPL-X bodies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that knowledge embedded in 2D foundation models can be transferred to create realistic 3D human-scene interaction data at scale. Current datasets for training embodied agents are either small and expensive or rely on crude geometric rules that miss real-world context. InHabit renders a 3D scene, uses a vision-language model to suggest plausible actions, edits the image to add a person, and runs an optimization to lift the edit into a 3D body model that respects scene geometry. The resulting dataset contains 78,000 samples across 800 building-scale environments. When mixed into standard training sets, this data raises accuracy on 3D human reconstruction and contact prediction tasks, and human raters favor it over prior synthetic examples in 78 percent of cases.

Core claim

InHabit follows a render-generate-lift principle: given a rendered 3D scene, a vision-language model proposes contextually meaningful actions, an image-editing model inserts a human, and an optimization procedure lifts the edited result into physically plausible SMPL-X bodies aligned with the scene geometry. Applied to Habitat-Matterport3D, InHabit produces the first large-scale photorealistic 3D human-scene interaction dataset, containing 78K samples across 800 building-scale scenes with complete 3D geometry, SMPL-X bodies, and RGB images. Augmenting standard training data with our samples improves RGB-based 3D human-scene reconstruction and contact estimation, and in a perceptual userstudy

What carries the argument

The render-generate-lift pipeline, which uses a vision-language model for action proposal, an image-editing model for human insertion, and optimization to fit SMPL-X parameters to 3D scene geometry.

If this is right

Augmenting existing training sets with InHabit samples raises accuracy on RGB-based 3D human-scene reconstruction.
Contact estimation between humans and scenes improves when models are trained with the generated data.
The new samples are rated higher than state-of-the-art synthetic data in 78 percent of direct perceptual comparisons.
The pipeline scales automatically to hundreds of large indoor environments without manual annotation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pipeline could be run on other large 3D scene collections to expand the range of environments covered.
Generated interaction data might serve as a starting point for training agents that predict human actions in novel spaces.
Refinements to the optimization step could further reduce any residual pose artifacts that current models leave behind.

Load-bearing premise

Off-the-shelf vision-language and image-editing models will produce contextually appropriate suggestions and placements that the subsequent optimization can turn into physically plausible 3D bodies without persistent artifacts or impossible configurations.

What would settle it

Generating a fresh batch of samples and counting the fraction of resulting SMPL-X bodies that exhibit interpenetrations with scene geometry or floating contacts after optimization.

Figures

Figures reproduced from arXiv: 2604.19673 by Anna Khoreva, Gerard Pons-Moll, Istv\'an S\'ar\'andi, Jiayi Wang, Nikita Kister, Pradyumna YM.

**Figure 1.** Figure 1: InHabit generates diverse, scene-aware 3D human placement across varied environments and actions. In this craft shop, it produces realistic behaviors such as browsing, leaning, crouching, and reaching. This enables large-scale generation of interaction data for embodied 3D scene understanding. Abstract. Training embodied agents to understand 3D scenes as humans do requires large-scale data of people mean… view at source ↗

**Figure 2.** Figure 2: Scalable 2D-to-3D human–scene interaction generation. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Representative samples from InHabitants. Each pair shows the edited RGB image (left) and the corresponding lifted 3D human in the 3D scene (right). Our dataset contains diverse, context-aware interactions across varied indoor environments. building-scale scans spanning residential, commercial, and institutional interiors with multiple floors and rooms each. This diversity exposes the generation pipeline t… view at source ↗

**Figure 4.** Figure 4: Distribution of InHabitants by action type (left) and by interacted object category (right), showing broad coverage of both diverse activities and scene elements. on this near-human band because edits are typically local to inserted people; distant regions can be preserved by compositing the near-human edited patch back into the original image. For each pixel q in this boundary, we compare Dedit against t… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison. Our method generates diverse interactions, including those with rare objects (tiger, fireplace) and touch-less interactions (map). It also produces natural poses fitting for the context of the object (sofa, pool). the interaction. This provides a more fine-grained measure of how closely the generated pose adheres to the interaction. 4.2 Qualitative Results [PITH_FULL_IMAGE:figures/… view at source ↗

**Figure 6.** Figure 6: Placements from our InHabit is widely preferred over others. In contrast, InHabit reasons about the scene prior to lifting, producing physically consistent and semantically coherent interactions, such as touching the tiger statue, lying on the sofa, playing pool, or pointing at a map. We refer to the supplementary material for many more examples. User Study. We conduct a perceptual user study comparing i… view at source ↗

read the original abstract

Training embodied agents to understand 3D scenes as humans do requires large-scale data of people meaningfully interacting with diverse environments, yet such data is scarce. Real-world capture is costly and limited to controlled settings, while existing synthetic datasets rely on simple geometric heuristics, ignoring rich scene context. In contrast, 2D foundation models trained at internet scale have acquired commonsense knowledge of human-environment interactions. To transfer this knowledge to 3D, we introduce InHabit, an automatic and scalable data generator for populating 3D scenes with interacting humans. InHabit follows a render-generate-lift principle: given a rendered 3D scene, a vision-language model proposes contextually meaningful actions, an image-editing model inserts a human, and an optimization procedure lifts the edited result into physically plausible SMPL-X bodies aligned with the scene geometry. Applied to Habitat-Matterport3D, InHabit produces InHabitants, the first large-scale photorealistic 3D human-scene interaction dataset, with 78K samples across $\sim$800 building-scale scenes with complete 3D geometry, SMPL-X bodies, and images. Augmenting standard training data with InHabitants improves RGB-based 3D human-scene reconstruction and contact estimation, and in a perceptual user study our data is preferred in 78% of cases over prior art.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InHabit gives a workable pipeline for scaling 3D human-scene data with foundation models, but the generated samples' physical plausibility rests on an untested assumption that the lift step works reliably at volume.

read the letter

InHabit chains a vision-language model, an image editor, and a 3D optimizer to turn rendered scenes into SMPL-X annotated interactions. The result is a 78K-sample dataset over 800 Habitat-Matterport3D buildings. That pipeline and the dataset size are the actual new pieces; earlier synthetic sets used geometric rules that ignored scene context, while this one tries to borrow the commonsense already inside large 2D models.

Referee Report

3 major / 2 minor

Summary. The paper introduces InHabit, a fully automatic pipeline that applies a render-generate-lift strategy to 3D scenes: a vision-language model proposes contextually meaningful actions, an image-editing model inserts a human figure, and an optimization step lifts the result into physically plausible SMPL-X bodies aligned with scene geometry. Applied to Habitat-Matterport3D, the method generates a dataset of 78K samples across 800 building-scale scenes containing complete 3D geometry, SMPL-X bodies, and RGB images. The authors claim that augmenting standard training data with these samples improves RGB-based 3D human-scene reconstruction and contact estimation, and that the generated data is preferred in 78% of cases over the state of the art in a perceptual user study.

Significance. If the pipeline reliably produces artifact-free, physically plausible samples at the claimed scale, the work would be significant for embodied AI and 3D vision: it offers a scalable route to context-rich human-scene interaction data that leverages commonsense knowledge implicit in internet-scale 2D foundation models, going beyond geometric heuristics or limited mocap. The render-generate-lift paradigm and the resulting dataset size represent a concrete advance in data generation for tasks requiring human-environment understanding.

major comments (3)

[Abstract and §3] Abstract and pipeline description (§3): The central claim of producing the first large-scale photorealistic 3D HSI dataset of 78K valid samples depends on the render-generate-lift pipeline consistently yielding artifact-free, physically plausible SMPL-X bodies without penetrations, floating, or implausible contacts. No quantitative metrics (e.g., optimization success rate, average penetration depth, contact accuracy, or failure-mode analysis) are reported to substantiate that a non-negligible fraction of samples are not invalid due to editing artifacts or under-constrained lifts.
[§5] §5 (downstream experiments): The stated improvements to RGB-based 3D human-scene reconstruction and contact estimation are presented without specific quantitative results, baseline comparisons, ablation tables, or error breakdowns showing the incremental gain from adding InHabit samples versus standard training data alone.
[User study section] User study (perceptual evaluation): The 78% preference rate is reported without details on study design, number of participants, question phrasing, number of comparisons per participant, or statistical significance testing, making it impossible to assess whether the result robustly supports the data-quality claim.

minor comments (2)

[§3.3] Clarify the exact objective function, constraints, and convergence criteria used in the SMPL-X optimization step, including any regularization terms for contact and penetration.
[§4] Add a table or figure summarizing the distribution of proposed actions, scene categories, and body poses in the final 78K dataset to allow readers to judge diversity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and commit to revisions that will strengthen the quantitative support for our claims.

read point-by-point responses

Referee: [Abstract and §3] Abstract and pipeline description (§3): The central claim of producing the first large-scale photorealistic 3D HSI dataset of 78K valid samples depends on the render-generate-lift pipeline consistently yielding artifact-free, physically plausible SMPL-X bodies without penetrations, floating, or implausible contacts. No quantitative metrics (e.g., optimization success rate, average penetration depth, contact accuracy, or failure-mode analysis) are reported to substantiate that a non-negligible fraction of samples are not invalid due to editing artifacts or under-constrained lifts.

Authors: We agree that quantitative metrics are essential to validate the pipeline's reliability at scale. The current manuscript relies on qualitative examples and the downstream perceptual study to support the 78K valid samples claim. In the revised version we will add an analysis section reporting optimization success rate, average penetration depth (using standard SMPL-X penetration metrics), contact accuracy where ground truth is available, and a summary of failure modes and filtering steps applied to reach the final dataset size. revision: yes
Referee: [§5] §5 (downstream experiments): The stated improvements to RGB-based 3D human-scene reconstruction and contact estimation are presented without specific quantitative results, baseline comparisons, ablation tables, or error breakdowns showing the incremental gain from adding InHabit samples versus standard training data alone.

Authors: The manuscript states that augmenting standard training data with InHabit samples improves reconstruction and contact estimation, but we acknowledge the lack of detailed quantitative breakdowns. We will revise §5 to include explicit numerical results (e.g., MPJPE, contact F1, or equivalent metrics), direct comparisons against baselines trained without InHabit data, ablation tables isolating the contribution of the new samples, and per-scene or per-category error analyses. revision: yes
Referee: [User study section] User study (perceptual evaluation): The 78% preference rate is reported without details on study design, number of participants, question phrasing, number of comparisons per participant, or statistical significance testing, making it impossible to assess whether the result robustly supports the data-quality claim.

Authors: We agree that the user-study description is insufficient for reproducibility and credibility assessment. In the revision we will expand the user-study section with the exact protocol: number of participants, recruitment method, precise question wording and interface, number of pairwise comparisons shown per participant, randomization procedure, and any statistical significance tests (e.g., binomial test or p-values) supporting the 78% preference. revision: yes

Circularity Check

0 steps flagged

No circularity: pipeline uses external models to generate independent dataset

full rationale

The derivation chain consists of applying off-the-shelf VLMs and image-editing models (external to the paper) followed by an optimization lift to SMPL-X, then empirically evaluating the resulting 78K-sample dataset via augmentation experiments and a separate perceptual study. No equations, parameters, or claims reduce by construction to fitted inputs, self-definitions, or author self-citations; the central results are new data plus downstream measurements that are falsifiable outside the generation process itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the effectiveness of pre-trained 2D foundation models for commonsense interaction knowledge and the reliability of the lifting optimization; no explicit free parameters are mentioned.

axioms (1)

domain assumption 2D foundation models trained on internet-scale data have implicitly acquired commonsense knowledge of human-environment interactions.
Invoked in the abstract as the core premise for transferring knowledge to 3D via the proposed pipeline.

invented entities (1)

InHabit render-generate-lift pipeline no independent evidence
purpose: To automatically generate large-scale 3D human-scene interaction data from existing 3D scenes.
New method introduced to bridge 2D models and 3D body placement.

pith-pipeline@v0.9.0 · 5569 in / 1422 out tokens · 55942 ms · 2026-05-10T03:00:56.463559+00:00 · methodology

InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)