Decoupled Generative Modeling for Human-Object Interaction Synthesis

Giljoo Nam; Hwanhee Jung; Jeongyoon Yoon; Qixing Huang; Sangpil Kim; Seunggwan Lee; SeungHyeon Kim

arxiv: 2512.19049 · v2 · submitted 2025-12-22 · 💻 cs.CV

Decoupled Generative Modeling for Human-Object Interaction Synthesis

Hwanhee Jung , Seunggwan Lee , Jeongyoon Yoon , SeungHyeon Kim , Giljoo Nam , Qixing Huang , Sangpil Kim This is my paper

Pith reviewed 2026-05-16 20:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords human-object interaction3D motion synthesisgenerative modelingtrajectory planningaction synthesisdecoupled frameworkcontact realismadversarial training

0 comments

The pith

Separating trajectory planning from detailed action synthesis produces more realistic and synchronized human-object interactions without manual waypoints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a decoupled generative framework called DecHOI for creating 3D human-object interaction sequences. It first uses a trajectory generator to plan human and object paths independently, then feeds those paths into a separate action generator that produces the fine-grained motions. This split aims to avoid the synchronization failures and object penetrations that arise when a single network must handle both planning and execution. An adversarial discriminator focused on distal joint dynamics further sharpens contact quality. On the FullBodyManipulation and 3D-FUTURE benchmarks the method records better scores than prior unified approaches on most metrics, with human raters also preferring the outputs.

Core claim

DecHOI separates the synthesis task into a trajectory generator that produces consistent human and object paths without prescribed waypoints and an action generator that conditions on those paths to synthesize detailed motions; adversarial training with a distal-joint discriminator improves contact realism, while the framework supports modeling of moving objects and responsive long-sequence planning in dynamic scenes while preserving plan consistency.

What carries the argument

The two-stage decoupled generator consisting of a trajectory planner that outputs paths and an action synthesizer conditioned on those paths, augmented by a distal-joint adversarial discriminator.

If this is right

The method eliminates the need for manually specified intermediate waypoints in HOI synthesis.
It enables responsive planning for moving objects and extended sequences while keeping trajectories consistent.
Targeted adversarial training on distal joints yields measurably more realistic contacts than standard objectives.
Quantitative gains appear across FullBodyManipulation and 3D-FUTURE benchmarks on most reported metrics.
Perceptual studies show human viewers favor the decoupled outputs over prior unified approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of planning and execution stages could be applied to other multi-agent or multi-object generative tasks in animation and robotics.
Plan consistency preservation may allow real-time replanning in interactive settings without recomputing entire sequences from scratch.
The distal-joint discriminator idea could be adapted to improve contact quality in related domains such as hand-object or cloth-body interactions.
If the decoupling pattern holds, future models might reduce hyperparameter search effort by handling high-level constraints in the trajectory stage alone.

Load-bearing premise

Separating path planning from action synthesis inherently reduces unsynchronized motion and penetration without requiring additional manual tuning or introducing new failure modes in dynamic scenes.

What would settle it

A controlled experiment on long dynamic sequences in which a single joint-optimization baseline matches or exceeds DecHOI on both motion synchronization error and object penetration rate would falsify the claimed benefit of decoupling.

Figures

Figures reproduced from arXiv: 2512.19049 by Giljoo Nam, Hwanhee Jung, Jeongyoon Yoon, Qixing Huang, Sangpil Kim, Seunggwan Lee, SeungHyeon Kim.

**Figure 2.** Figure 2: Architecture of DecHOI showing the decoupled trajectory and action generation process. Conditioned on the text instruction, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Adversarial module of DecHOI, where a hand and foot [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of DecHOI with CHOIS [ [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Visualization of training loss landscapes for DecHOI and [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of DecHOI in long-sequence dynamic environments. The human agent (blue) adaptively re-plans its path when [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Stacked horizontal bars showing user preference [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 1.** Figure 1: Visualization of trade-off relationships induced by the [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗

**Figure 2.** Figure 2: The green dot and circle denote the agent and [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

**Figure 3.** Figure 3: Example 2AFC interface in which participants read a [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: Additional qualitative comparison of DecHOI with CHOIS [ [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Additional qualitative comparison of DecHOI and CHOIS [ [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of DecHOI in long-sequence dynamic environments. The human agent (blue) adaptively re-plans its path when [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

read the original abstract

Synthesizing realistic human-object interaction (HOI) is essential for 3D computer vision and robotics, underpinning animation and embodied control. Existing approaches often require manually specified intermediate waypoints and place all optimization objectives on a single network, which increases complexity, reduces flexibility, and leads to errors such as unsynchronized human and object motion or penetration. To address these issues, we propose Decoupled Generative Modeling for Human-Object Interaction Synthesis (DecHOI), which separates path planning and action synthesis. A trajectory generator first produces human and object trajectories without prescribed waypoints, and an action generator conditions on these paths to synthesize detailed motions. To further improve contact realism, we employ adversarial training with a discriminator that focuses on the dynamics of distal joints. The framework also models a moving counterpart and supports responsive, long-sequence planning in dynamic scenes, while preserving plan consistency. Across two benchmarks, FullBodyManipulation and 3D-FUTURE, DecHOI surpasses prior methods on most quantitative metrics and qualitative evaluations, and perceptual studies likewise prefer our results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main move is splitting HOI synthesis into a waypoint-free trajectory stage followed by a conditioned action stage plus a distal-joint discriminator, which looks like a reasonable way to ease some common artifacts.

read the letter

The key takeaway is that this paper splits HOI motion synthesis into a trajectory planner that works without manual waypoints and an action generator that fills in the details, with a discriminator tuned to distal joint dynamics. That separation looks like a practical way to cut down on motion sync issues and penetrations. What stands out is the two-stage design. Most prior work piles everything into one network, which can get messy with competing objectives. Here they generate paths first for human and object without prescribed waypoints, then condition the detailed motion on those paths. They also model moving counterparts and support responsive long-sequence planning while keeping plan consistency. The distal-joint focus in the adversarial training targets contact realism specifically, which targets a common failure mode in these syntheses. On the positive side, the abstract reports that DecHOI surpasses prior methods on most quantitative metrics and qualitative evaluations across FullBodyManipulation and 3D-FUTURE benchmarks, with perceptual studies preferring the results. If the full paper shows solid ablations and the improvements are consistent, this decoupled approach could streamline pipelines in animation and embodied AI where manual intervention is a bottleneck. The soft spots center on how well the decoupling holds up in practice. The central assumption is that separating path planning from action synthesis reduces unsynchronized motions and penetrations without new failure modes, but the abstract doesn't detail tests for error propagation between the modules or performance in highly dynamic scenes. Without quantitative details, error bars, or ablation studies visible in the summary, it's difficult to assess how much of the gain comes from the split versus other implementation choices. The stress-test concern about tight conditioning to avoid reintroducing errors seems relevant until the full results are checked. This work is for researchers in computer vision and robotics focused on human-object interaction synthesis. Readers looking for architectural ideas to improve generative models for realistic motions would find it useful. It deserves a serious referee because the problem is well-motivated and the proposed split is a substantive change from single-network methods, even if the empirical support needs closer examination during review. I recommend sending it out for peer review to get detailed feedback on the experiments and robustness.

Referee Report

2 major / 1 minor

Summary. The paper introduces DecHOI, a decoupled generative framework for human-object interaction synthesis that separates trajectory generation (producing human and object paths without prescribed waypoints) from action synthesis (conditioning detailed motions on those paths, augmented by distal-joint adversarial training). It claims this reduces unsynchronized motions and penetrations, supports responsive long-sequence planning with moving counterparts, and outperforms prior methods on the FullBodyManipulation and 3D-FUTURE benchmarks in quantitative metrics, qualitative evaluations, and perceptual studies.

Significance. If the empirical gains and robustness to error propagation are confirmed, the modular decoupling could meaningfully improve flexibility and reduce manual tuning in HOI synthesis pipelines for animation and robotics, moving beyond single-network optimization that often trades off contact realism against motion coherence.

major comments (2)

[Abstract] Abstract: superiority is asserted on 'most quantitative metrics' across two benchmarks without reporting specific values, standard deviations, or ablation controls, leaving the central claim that decoupling reduces unsynchronized motion and penetration unverifiable from the provided text.
[Method] Method (trajectory-to-action interface): the claim that conditioning the action generator on planned paths automatically prevents upstream trajectory errors from reintroducing unsynchronized motions or penetrations in dynamic scenes is load-bearing yet unsupported by any described stress test, loss-balancing analysis, or failure-mode evaluation when the counterpart is moving.

minor comments (1)

Ensure all quantitative tables include error bars or confidence intervals and clearly label which metrics are 'most' improved versus those where prior methods remain competitive.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the presentation of DecHOI. We agree that the abstract requires more concrete reporting and that the robustness claims around the trajectory-to-action interface merit additional validation. We will revise the manuscript accordingly, updating the abstract with specific metrics and adding targeted experiments and analysis for the method section. Point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: superiority is asserted on 'most quantitative metrics' across two benchmarks without reporting specific values, standard deviations, or ablation controls, leaving the central claim that decoupling reduces unsynchronized motion and penetration unverifiable from the provided text.

Authors: We agree that the abstract should be more specific to make the central claims verifiable. In the revised version we will replace the phrase 'surpasses prior methods on most quantitative metrics' with concrete numbers drawn from Tables 1 and 2 (e.g., contact error reduction of X% ± std on FullBodyManipulation and penetration reduction of Y% ± std on 3D-FUTURE), explicitly note the ablation results that isolate the contribution of decoupling, and mention the perceptual study preference rates. These values and standard deviations are already computed and reported in the experimental section; only the abstract summary will be updated. revision: yes
Referee: [Method] Method (trajectory-to-action interface): the claim that conditioning the action generator on planned paths automatically prevents upstream trajectory errors from reintroducing unsynchronized motions or penetrations in dynamic scenes is load-bearing yet unsupported by any described stress test, loss-balancing analysis, or failure-mode evaluation when the counterpart is moving.

Authors: The design intentionally decouples the modules so that the action generator receives the full planned trajectories as conditioning input and is trained to produce motions that follow those trajectories exactly; this architectural constraint, together with the distal-joint adversarial loss, is intended to limit error propagation. We acknowledge, however, that the current manuscript does not include explicit stress tests that inject controlled trajectory noise or evaluate long-horizon dynamic scenes with moving objects. In the revision we will add (i) a new subsection with perturbation experiments that measure synchronization and penetration metrics under increasing trajectory error, (ii) a short loss-balancing analysis, and (iii) qualitative failure-case examples for moving-counterpart scenarios. These additions will be placed in the supplementary material with a brief reference in the main text. revision: yes

Circularity Check

0 steps flagged

No significant circularity; decoupling is an independent architectural choice

full rationale

The provided abstract and description frame DecHOI as a separation of trajectory generation (no waypoints) from conditioned action synthesis plus distal-joint adversarial training. No equations, fitted-parameter renamings, or self-citation chains are exhibited that would reduce any claimed prediction or result to its inputs by construction. Benchmark comparisons on FullBodyManipulation and 3D-FUTURE are external evaluations rather than internal re-expressions. The derivation chain therefore remains self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that separate trajectory and action models can be trained to produce consistent plans without additional constraints.

pith-pipeline@v0.9.0 · 5499 in / 1103 out tokens · 17389 ms · 2026-05-16T20:24:59.539224+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A trajectory generator first produces human and object trajectories without prescribed waypoints, and an action generator conditions on these paths to synthesize detailed motions... adversarial training with a discriminator that focuses on the dynamics of distal joints.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The framework also models a moving counterpart and supports responsive, long-sequence planning in dynamic scenes, while preserving plan consistency.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Uni-HOI:A Unified framework for Learning the Joint distribution of Text and Human-Object Interaction
cs.CV 2026-04 unverdicted novelty 5.0

Uni-HOI learns the joint distribution of text, human motion, and object motion using LLMs and VQ-VAEs in a two-stage training process for multiple HOI tasks.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

3d-future: 3d fur- niture shape with texture

Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong, Binqiang Zhao, Steve Maybank, and Dacheng Tao. 3d-future: 3d fur- niture shape with texture. International Journal of Computer Vision, 129(12):3313–3337, 2021. 6, 8

work page 2021
[2]

Generating diverse and natural 3d human motions from text

Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 5152–5161, 2022. 2

work page 2022
[3]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. Advances in neural information processing systems , 30, 2017. 2

work page 2017
[4]

Denoising dif- fusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. 1

work page 2020
[5]

Kingma and Jimmy Ba

Diederik P . Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015. 1

work page 2015
[6]

Visualizing the loss landscape of neural nets

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. Ad- vances in neural information processing systems , 31, 2018. 5

work page 2018
[7]

Object motion guided human motion synthesis

Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis. ACM Transactions on Graphics (TOG), 42(6):1–11, 2023. 4, 5, 6, 7

work page 2023
[8]

Controllable human-object interaction synthesis

Jiaman Li, Alexander Clegg, Roozbeh Mottaghi, Jiajun Wu, Xavier Puig, and C Karen Liu. Controllable human-object interaction synthesis. In European Conference on Computer Vision, pages 54–72. Springer, 2024. 1, 2, 3, 4, 5, 6, 7, 8

work page 2024
[9]

A simple yet effective baseline for 3d human pose esti- mation

Julieta Martinez, Rayat Hossain, Javier Romero, and James J Little. A simple yet effective baseline for 3d human pose esti- mation. In Proceedings of the IEEE international conference on computer vision, pages 2640–2649, 2017. 2

work page 2017
[10]

Social-stgcnn: A social spatio-temporal graph convolutional neural network for human trajectory prediction

Abduallah Mohamed, Kun Qian, Mohamed Elhoseiny, and Christian Claudel. Social-stgcnn: A social spatio-temporal graph convolutional neural network for human trajectory prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 14424– 14432, 2020. 3

work page 2020
[11]

The Replica Dataset: A Digital Replica of Indoor Spaces

Julian Straub, Thomas Whelan, Lingni Ma, Y ufan Chen, Erik Wijmans, Simon Green, Jakob J Engel, Raul Mur-Artal, Carl Ren, Shobhit V erma, et al. The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797 , 2019. 3

work page internal anchor Pith review Pith/arXiv arXiv 1906
[12]

A learning algorithm for continually running fully recurrent neural networks

Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neu- ral computation, 1(2):270–280, 1989. 1

work page 1989
[13]

Human- object interaction from human-level instructions

Zhen Wu, Jiaman Li, Pei Xu, and C Karen Liu. Human- object interaction from human-level instructions. In Pro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 11176–11186, 2025. 1, 4, 5, 6, 7 9

work page 2025

[1] [1]

3d-future: 3d fur- niture shape with texture

Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong, Binqiang Zhao, Steve Maybank, and Dacheng Tao. 3d-future: 3d fur- niture shape with texture. International Journal of Computer Vision, 129(12):3313–3337, 2021. 6, 8

work page 2021

[2] [2]

Generating diverse and natural 3d human motions from text

Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 5152–5161, 2022. 2

work page 2022

[3] [3]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. Advances in neural information processing systems , 30, 2017. 2

work page 2017

[4] [4]

Denoising dif- fusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. 1

work page 2020

[5] [5]

Kingma and Jimmy Ba

Diederik P . Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015. 1

work page 2015

[6] [6]

Visualizing the loss landscape of neural nets

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. Ad- vances in neural information processing systems , 31, 2018. 5

work page 2018

[7] [7]

Object motion guided human motion synthesis

Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis. ACM Transactions on Graphics (TOG), 42(6):1–11, 2023. 4, 5, 6, 7

work page 2023

[8] [8]

Controllable human-object interaction synthesis

Jiaman Li, Alexander Clegg, Roozbeh Mottaghi, Jiajun Wu, Xavier Puig, and C Karen Liu. Controllable human-object interaction synthesis. In European Conference on Computer Vision, pages 54–72. Springer, 2024. 1, 2, 3, 4, 5, 6, 7, 8

work page 2024

[9] [9]

A simple yet effective baseline for 3d human pose esti- mation

Julieta Martinez, Rayat Hossain, Javier Romero, and James J Little. A simple yet effective baseline for 3d human pose esti- mation. In Proceedings of the IEEE international conference on computer vision, pages 2640–2649, 2017. 2

work page 2017

[10] [10]

Social-stgcnn: A social spatio-temporal graph convolutional neural network for human trajectory prediction

Abduallah Mohamed, Kun Qian, Mohamed Elhoseiny, and Christian Claudel. Social-stgcnn: A social spatio-temporal graph convolutional neural network for human trajectory prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 14424– 14432, 2020. 3

work page 2020

[11] [11]

The Replica Dataset: A Digital Replica of Indoor Spaces

Julian Straub, Thomas Whelan, Lingni Ma, Y ufan Chen, Erik Wijmans, Simon Green, Jakob J Engel, Raul Mur-Artal, Carl Ren, Shobhit V erma, et al. The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797 , 2019. 3

work page internal anchor Pith review Pith/arXiv arXiv 1906

[12] [12]

A learning algorithm for continually running fully recurrent neural networks

Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neu- ral computation, 1(2):270–280, 1989. 1

work page 1989

[13] [13]

Human- object interaction from human-level instructions

Zhen Wu, Jiaman Li, Pei Xu, and C Karen Liu. Human- object interaction from human-level instructions. In Pro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 11176–11186, 2025. 1, 4, 5, 6, 7 9

work page 2025