pith. machine review for the scientific record. sign in

arxiv: 2605.04554 · v2 · submitted 2026-05-06 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

InterMesh: Explicit Interaction-Aware End-to-End Multi-Person Human Mesh Recovery

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:26 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-person human mesh recoveryhuman-object interactionpose estimationshape estimationDETR frameworkinteraction-aware modeling3D reconstruction
0
0 comments X

The pith

InterMesh improves multi-person human mesh recovery by explicitly adding structured interaction semantics from a detector into query features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard DETR-style methods for recovering 3D meshes of multiple people only capture interactions implicitly through self-attention. InterMesh instead feeds explicit human-object and human-human interaction signals from a dedicated detector into two lightweight modules that update the person queries. These modules enrich the features used for pose and shape prediction without changing the core architecture much. Experiments on five datasets confirm lower error rates, especially where people touch or use objects. A reader would care because many real scenes involve such interactions, and better meshes support downstream uses like robotics and augmented reality.

Core claim

InterMesh explicitly incorporates human-environment interaction information into the human mesh recovery pipeline by leveraging a human-object interaction detector to enrich query representations with structured interaction semantics, achieved through lightweight Contextual Interaction Encoder and Interaction-Guided Refiner modules that integrate these features into existing HMR architectures with minimal overhead for more accurate pose and shape estimation.

What carries the argument

Contextual Interaction Encoder and Interaction-Guided Refiner modules that integrate structured interaction semantics from a human-object interaction detector into the person query representations.

If this is right

  • Multi-person mesh estimates become more accurate in scenes with object contacts and group interactions, as measured by MPJPE reductions of 9.9 percent on CMU Panoptic and 8.2 percent on Hi4D.
  • Existing DETR-based human mesh recovery pipelines can incorporate explicit interaction cues via small added modules without full retraining.
  • Performance gains appear across varied datasets including 3DPW, MuPoTS, CMU Panoptic, Hi4D, and CHI3D when interaction complexity is high.
  • Pose and shape outputs benefit from relational semantics beyond what implicit attention alone provides.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If interaction detectors continue to improve independently, mesh recovery accuracy would increase without retraining the mesh model itself.
  • The same enrichment pattern could be tested on single-person recovery or on tasks like action recognition that already use human meshes.
  • Real-time deployment would require the interaction detector to match the mesh model's speed so that total latency stays acceptable.
  • Similar explicit relational features might help other vision problems that currently rely only on self-attention for scene context.

Load-bearing premise

The human-object interaction detector supplies accurate, relevant interaction signals that can be fused directly to improve mesh estimates without introducing new errors.

What would settle it

Running the same pipeline on Hi4D or CMU Panoptic with the interaction detector outputs replaced by random or zeroed features and observing no drop or an increase in MPJPE would falsify the claim that the explicit semantics drive the accuracy gains.

Figures

Figures reproduced from arXiv: 2605.04554 by Chenyi Guo, Ji Wu, Kaili Zheng, Kaiwen Wang, Xun Zhu.

Figure 1
Figure 1. Figure 1: Overview of InterMesh. Unlike prior methods that rely solely on implicit modeling of inter-human relationships, view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of a single decoder layer of InterMesh. Each layer takes as input human queries updated by self-attention view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison on 3DPW. (a) 3DPW (b) Hi4D view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the attention masks used for batched training with [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-vertex error comparison on 3DPW and Hi4D. view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on 3DPW. (person, carry, person) Image SAT-HMR InterMesh (person, hold, person) (person, hug, person) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study on the starting decoder layer for view at source ↗
Figure 6
Figure 6. Figure 6: Example of attention masks for (a) MHSA in Con view at source ↗
Figure 6
Figure 6. Figure 6: Visualization results on CMU Panoptic [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison on Hi4D. SAT-HMR mainly due to the additional HOI detection com￾ponent. However, compared to methods such as SAT-HMR that only estimate human meshes, InterMesh also produces fine-grained HOI labels, which provides additional inter￾pretable interaction outputs. The increase in complexity within the HMR model itself is moderate. This allows In￾terMesh to directly benefit from future im… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization results on MuPoTS. of object context that is entirely absent in prior multi-person HMR methods. Combining both interaction types achieves the best results across all metrics, demonstrating that human￾human and human-object cues provide complementary infor￾mation for more accurate pose and shape estimation. We further investigate the impact of incorporating human￾environment interaction featur… view at source ↗
Figure 8
Figure 8. Figure 8: More visualization results on 3DPW view at source ↗
Figure 8
Figure 8. Figure 8: Visualization results on in-the-wild samples. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization results on CMU Panoptic view at source ↗
Figure 9
Figure 9. Figure 9: Per-vertex error comparison on 3DPW and Hi4D. Green regions [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualization results on MuPoTS view at source ↗
Figure 10
Figure 10. Figure 10: Ablation study on the starting decoder layer for incorporating human [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization results on in-the-wild samples. view at source ↗
read the original abstract

Humans constantly interact with their surroundings. Existing end-to-end multi-person human mesh recovery methods, typically based on the DETR framework, capture inter-human relationships through self-attention across all human queries. However, these approaches model interactions only implicitly and lack explicit reasoning about how humans interact with objects and with each other. In this paper, we propose InterMesh, a simple yet effective framework that explicitly incorporates human-environment interaction information into human mesh recovery pipeline. By leveraging a human-object interaction detector, InterMesh enriches query representations with structured interaction semantics, enabling more accurate pose and shape estimation. We design lightweight modules, Contextual Interaction Encoder and Interaction-Guided Refiner, to integrate these features into existing HMR architectures with minimal overhead. We validate our approach through extensive experiments on 3DPW, MuPoTS, CMU Panoptic, Hi4D, and CHI3D datasets, demonstrating remarkable improvements over state-of-the-art methods. Notably, InterMesh reduces MPJPE by 9.9% on CMU Panoptic and 8.2% on Hi4D, highlighting its effectiveness in scenarios with complex human-object and inter-human interactions. Code and models are released at https://github.com/Kelly510/InterMesh.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces InterMesh, an end-to-end multi-person human mesh recovery framework that augments DETR-style query representations with explicit human-object interaction semantics extracted from an off-the-shelf HOI detector. It proposes lightweight Contextual Interaction Encoder and Interaction-Guided Refiner modules to fuse these features into existing HMR pipelines and reports consistent gains over prior methods on 3DPW, MuPoTS, CMU Panoptic, Hi4D, and CHI3D, including 9.9% MPJPE reduction on CMU Panoptic and 8.2% on Hi4D, while releasing code.

Significance. If the gains prove robust, the work would usefully extend implicit self-attention baselines by injecting structured interaction priors, with particular value in crowded or object-rich scenes. The public release of code and models is a clear strength that supports reproducibility.

major comments (1)
  1. [Experiments] Experiments section: the central claim that the HOI detector supplies accurate, noise-free semantics that the Contextual Interaction Encoder and Interaction-Guided Refiner can directly exploit rests on an untested assumption. No ablation replaces the detector with noisy, weaker, or random oracles, nor is error propagation quantified; the reported MPJPE reductions could therefore be sensitive to detector quality in occluded or crowded scenes.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'remarkable improvements' is subjective; replace with precise percentage gains and baseline names for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the experimental validation of the HOI detector's role. We agree that additional analysis is warranted and will incorporate the suggested ablation in the revised manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central claim that the HOI detector supplies accurate, noise-free semantics that the Contextual Interaction Encoder and Interaction-Guided Refiner can directly exploit rests on an untested assumption. No ablation replaces the detector with noisy, weaker, or random oracles, nor is error propagation quantified; the reported MPJPE reductions could therefore be sensitive to detector quality in occluded or crowded scenes.

    Authors: We agree that testing robustness to HOI detector quality strengthens the claims. In the revision we will add an ablation that replaces the off-the-shelf HOI detector outputs with controlled noisy versions (random labels, reduced-accuracy oracles, and simulated occlusion-induced errors) and report the resulting MPJPE changes on CMU Panoptic and Hi4D. This will quantify error propagation through the Contextual Interaction Encoder and Interaction-Guided Refiner. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline with external detector inputs

full rationale

The paper describes an end-to-end architecture that feeds outputs from a pre-existing human-object interaction detector into Contextual Interaction Encoder and Interaction-Guided Refiner modules within a DETR-style HMR backbone. No equations, fitted parameters, or self-citation chains are presented that reduce the claimed MPJPE improvements to quantities defined inside the same model. Validation occurs on independent external benchmarks (3DPW, MuPoTS, CMU Panoptic, Hi4D, CHI3D) with reported gains over prior SOTA, satisfying the criteria for a self-contained empirical contribution without load-bearing self-definition or renaming of known results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a pre-trained human-object interaction detector supplies useful structured features and that standard end-to-end training on the listed datasets will integrate those features effectively.

free parameters (1)
  • interaction integration weights
    Learned parameters inside the Contextual Interaction Encoder and Interaction-Guided Refiner that control how detector outputs are fused.
axioms (1)
  • domain assumption The human-object interaction detector produces reliable and task-relevant semantics
    The pipeline depends on this external detector to enrich queries; its accuracy is taken as given.

pith-pipeline@v0.9.0 · 5531 in / 1372 out tokens · 82537 ms · 2026-05-15T06:26:10.329284+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 4 internal anchors

  1. [1]

    Smpl: A skinned multi-person linear model,

    M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: A skinned multi-person linear model,”ACM Transactions on Graphics, vol. 34, no. 6, 2015

  2. [2]

    Photo wake- up: 3d character animation from a single photo,

    C.-Y . Weng, B. Curless, and I. Kemelmacher-Shlizerman, “Photo wake- up: 3d character animation from a single photo,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5908–5917

  3. [3]

    Learning human dynamics in autonomous driving scenarios,

    J. Wang, Y . Yuan, Z. Luo, K. Xie, D. Lin, U. Iqbal, S. Fidler, and S. Khamis, “Learning human dynamics in autonomous driving scenarios,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 20 796–20 806

  4. [4]

    Posecoach: A customizable analysis and visualization system for video-based running coaching,

    J. Liu, N. Saquib, C. Zhutian, R. H. Kazi, L.-Y . Wei, H. Fu, and C.-L. Tai, “Posecoach: A customizable analysis and visualization system for video-based running coaching,”IEEE Transactions on Visualization and Computer Graphics, vol. 30, no. 7, pp. 3180–3195, 2022

  5. [5]

    Icon: Implicit clothed humans obtained from normals,

    Y . Xiu, J. Yang, D. Tzionas, and M. J. Black, “Icon: Implicit clothed humans obtained from normals,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022, pp. 13 286–13 296

  6. [6]

    Learning to reconstruct 3d human pose and shape via model-fitting in the loop,

    N. Kolotouros, G. Pavlakos, M. J. Black, and K. Daniilidis, “Learning to reconstruct 3d human pose and shape via model-fitting in the loop,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 2252–2261

  7. [7]

    Vibe: Video inference for human body pose and shape estimation,

    M. Kocabas, N. Athanasiou, and M. J. Black, “Vibe: Video inference for human body pose and shape estimation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5253–5263. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12 Method Parameters (M) FLOPs (G) Runtime (ms) HMR Model HOI Detector HMR Model HO...

  8. [8]

    Pare: Part attention regressor for 3d human body estimation,

    M. Kocabas, C.-H. P. Huang, O. Hilliges, and M. J. Black, “Pare: Part attention regressor for 3d human body estimation,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 11 127–11 137

  9. [9]

    3d human mesh reconstruction by learning to sample joint adaptive tokens for transformers,

    Y . Xue, J. Chen, Y . Zhang, C. Yu, H. Ma, and H. Ma, “3d human mesh reconstruction by learning to sample joint adaptive tokens for transformers,” inProceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 6765–6773

  10. [10]

    Cliff: Carrying location information in full frames into human pose and shape estimation,

    Z. Li, J. Liu, Z. Zhang, S. Xu, and Y . Yan, “Cliff: Carrying location information in full frames into human pose and shape estimation,” in European Conference on Computer Vision. Springer, 2022, pp. 590– 606

  11. [11]

    Humans in 4d: Reconstructing and tracking humans with transformers,

    S. Goel, G. Pavlakos, J. Rajasegaran, A. Kanazawa, and J. Malik, “Humans in 4d: Reconstructing and tracking humans with transformers,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 14 783–14 794

  12. [12]

    Smpler-x: Scaling up expressive human pose and shape estimation,

    Z. Cai, W. Yin, A. Zeng, C. Wei, Q. Sun, W. Yanjun, H. E. Pang, H. Mei, M. Zhang, L. Zhanget al., “Smpler-x: Scaling up expressive human pose and shape estimation,”Advances in Neural Information Processing Systems, vol. 36, pp. 11 454–11 468, 2023

  13. [13]

    Body meshes as points,

    J. Zhang, D. Yu, J. H. Liew, X. Nie, and J. Feng, “Body meshes as points,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 546–556

  14. [14]

    Monocular, one-stage, regression of multiple 3d people,

    Y . Sun, Q. Bao, W. Liu, Y . Fu, M. J. Black, and T. Mei, “Monocular, one-stage, regression of multiple 3d people,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 11 179–11 188

  15. [15]

    Putting people in their place: Monocular regression of 3d people in depth,

    Y . Sun, W. Liu, Q. Bao, Y . Fu, T. Mei, and M. J. Black, “Putting people in their place: Monocular regression of 3d people in depth,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 243–13 252

  16. [16]

    Psvt: End-to-end multi-person 3d pose and shape estimation with progressive video transformers,

    Z. Qiu, Q. Yang, J. Wang, H. Feng, J. Han, E. Ding, C. Xu, D. Fu, and J. Wang, “Psvt: End-to-end multi-person 3d pose and shape estimation with progressive video transformers,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 21 254–21 263

  17. [17]

    Aios: All-in-one-stage expressive human pose and shape estimation,

    Q. Sun, Y . Wang, A. Zeng, W. Yin, C. Wei, W. Wang, H. Mei, C.-S. Leung, Z. Liu, L. Yanget al., “Aios: All-in-one-stage expressive human pose and shape estimation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 1834–1843

  18. [18]

    Multi-hmr: Multi-person whole-body human mesh recovery in a single shot,

    F. Baradel, M. Armando, S. Galaaoui, R. Br ´egier, P. Weinzaepfel, G. Rogez, and T. Lucas, “Multi-hmr: Multi-person whole-body human mesh recovery in a single shot,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 202–218

  19. [19]

    Sat-hmr: Real-time multi- person 3d mesh estimation via scale-adaptive tokens,

    C. Su, X. Ma, J. Su, and Y . Wang, “Sat-hmr: Real-time multi- person 3d mesh estimation via scale-adaptive tokens,”arXiv preprint arXiv:2411.19824, 2024

  20. [20]

    End-to-end object detection with transformers,

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision. Springer, 2020, pp. 213– 229

  21. [21]

    Reconstructing three-dimensional models of interacting humans,

    M. Fieraru, M. Zanfir, E. Oneata, A.-I. Popa, V . Olaru, and C. Sminchis- escu, “Reconstructing three-dimensional models of interacting humans,” arXiv preprint arXiv:2308.01854, 2023

  22. [22]

    Closely interactive human reconstruction with proxemics and physics-guided adaption,

    B. Huang, C. Li, C. Xu, L. Pan, Y . Wang, and G. H. Lee, “Closely interactive human reconstruction with proxemics and physics-guided adaption,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1011–1021

  23. [23]

    Populating 3d scenes by learning human-scene interaction,

    M. Hassan, P. Ghosh, J. Tesch, D. Tzionas, and M. J. Black, “Populating 3d scenes by learning human-scene interaction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 708–14 718

  24. [24]

    Chore: Contact, human and object reconstruction from a single rgb image,

    X. Xie, B. L. Bhatnagar, and G. Pons-Moll, “Chore: Contact, human and object reconstruction from a single rgb image,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 125–145

  25. [25]

    Deco: Dense estimation of 3d human-scene contact in the wild,

    S. Tripathi, A. Chatterjee, J.-C. Passy, H. Yi, D. Tzionas, and M. J. Black, “Deco: Dense estimation of 3d human-scene contact in the wild,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8001–8013

  26. [26]

    Generative proxemics: A prior for 3d social interaction from images,

    L. M ¨uller, V . Ye, G. Pavlakos, M. Black, and A. Kanazawa, “Generative proxemics: A prior for 3d social interaction from images,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9687–9697

  27. [27]

    Capturing closely interacted two-person motions with reaction priors,

    Q. Fang, Y . Fan, Y . Li, J. Dong, D. Wu, W. Zhang, and K. Chen, “Capturing closely interacted two-person motions with reaction priors,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 655–665

  28. [28]

    Learning human mesh recovery in 3d scenes,

    Z. Shen, Z. Cen, S. Peng, Q. Shuai, H. Bao, and X. Zhou, “Learning human mesh recovery in 3d scenes,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 038–17 047

  29. [29]

    Probabilistic human mesh recovery in 3d scenes from egocentric views,

    S. Zhang, Q. Ma, Y . Zhang, S. Aliakbarian, D. Cosker, and S. Tang, “Probabilistic human mesh recovery in 3d scenes from egocentric views,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7989–8000

  30. [30]

    Ez-hoi: Vlm adaptation via guided prompt learning for zero-shot hoi detection,

    Q. Lei, B. Wang, and R. Tan, “Ez-hoi: Vlm adaptation via guided prompt learning for zero-shot hoi detection,”Advances in Neural Information Processing Systems, vol. 37, pp. 55 831–55 857, 2024

  31. [31]

    Camerahmr: Aligning people with perspec- tive,

    P. Patel and M. J. Black, “Camerahmr: Aligning people with perspec- tive,”arXiv preprint arXiv:2411.08128, 2024

  32. [32]

    A twist representation and shape refinement method for human mesh recovery,

    X. Hao, H. Li, J. Sun, L. Wang, and J. Fan, “A twist representation and shape refinement method for human mesh recovery,”IEEE Transactions on Multimedia, vol. 27, pp. 1821–1834, 2025

  33. [33]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

  34. [34]

    Deformable DETR: Deformable Transformers for End-to-End Object Detection

    X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,”arXiv preprint arXiv:2010.04159, 2020

  35. [35]

    Dab-detr: Dynamic anchor boxes are better queries for detr,

    S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang, “Dab-detr: Dynamic anchor boxes are better queries for detr,”arXiv preprint arXiv:2201.12329, 2022

  36. [36]

    Prompthmr: Promptable human mesh recovery,

    Y . Wang, Y . Sun, P. Patel, K. Daniilidis, M. J. Black, and M. Kocabas, “Prompthmr: Promptable human mesh recovery,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 1148– 1159

  37. [37]

    Three-dimensional reconstruction of human interactions,

    M. Fieraru, M. Zanfir, E. Oneata, A.-I. Popa, V . Olaru, and C. Smin- chisescu, “Three-dimensional reconstruction of human interactions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 7214–7223

  38. [38]

    Hi4d: 4d instance segmentation of close human interaction,

    Y . Yin, C. Guo, M. Kaufmann, J. J. Zarate, J. Song, and O. Hilliges, “Hi4d: 4d instance segmentation of close human interaction,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 016–17 027

  39. [39]

    Interact as you intend: Intention-driven human-object interaction detection,

    B. Xu, J. Li, Y . Wong, Q. Zhao, and M. S. Kankanhalli, “Interact as you intend: Intention-driven human-object interaction detection,”IEEE Transactions on Multimedia, vol. 22, no. 6, pp. 1423–1432, 2020

  40. [40]

    Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer,

    F. Z. Zhang, D. Campbell, and S. Gould, “Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 20 104–20 112

  41. [41]

    Efficient adaptive human-object interaction detection with concept-guided memory,

    T. Lei, F. Caba, Q. Chen, H. Jin, Y . Peng, and Y . Liu, “Efficient adaptive human-object interaction detection with concept-guided memory,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 6480–6490

  42. [42]

    Hodn: Disentangling human-object feature for hoi detection,

    S. Fang, Z. Lin, K. Yan, J. Li, X. Lin, and R. Ji, “Hodn: Disentangling human-object feature for hoi detection,”IEEE Transactions on Multi- media, vol. 26, pp. 3125–3136, 2024

  43. [43]

    Clip4hoi: Towards JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13 adapting clip for practical zero-shot hoi detection,

    Y . Mao, J. Deng, W. Zhou, L. Li, Y . Fang, and H. Li, “Clip4hoi: Towards JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13 adapting clip for practical zero-shot hoi detection,”Advances in Neural Information Processing Systems, vol. 36, pp. 45 895–45 906, 2023

  44. [44]

    Ask-hoi: Affordance- scene knowledge prompting for human-object interaction detection,

    D. Chen, D. Kong, J. Gao, J. Li, Q. Li, and B. Yin, “Ask-hoi: Affordance- scene knowledge prompting for human-object interaction detection,” IEEE Transactions on Multimedia, vol. 28, pp. 742–756, 2025

  45. [45]

    Glamr: Global occlusion-aware human mesh recovery with dynamic cameras,

    Y . Yuan, U. Iqbal, P. Molchanov, K. Kitani, and J. Kautz, “Glamr: Global occlusion-aware human mesh recovery with dynamic cameras,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11 038–11 049

  46. [46]

    Decoupling human and camera motion from videos in the wild,

    V . Ye, G. Pavlakos, J. Malik, and A. Kanazawa, “Decoupling human and camera motion from videos in the wild,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 21 222–21 232

  47. [47]

    Wham: Reconstructing world-grounded humans with accurate 3d motion,

    S. Shin, J. Kim, E. Halilaj, and M. J. Black, “Wham: Reconstructing world-grounded humans with accurate 3d motion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 2070–2080

  48. [48]

    Syn- ergistic global-space camera and human reconstruction from videos,

    Y . Zhao, T. Y . Wang, B. Raj, M. Xu, J. Yang, and C.-H. P. Huang, “Syn- ergistic global-space camera and human reconstruction from videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1216–1226

  49. [49]

    The hungarian method for the assignment problem,

    H. W. Kuhn, “The hungarian method for the assignment problem,”Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955

  50. [50]

    Generalized intersection over union: A metric and a loss for bounding box regression,

    H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” inProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, 2019, pp. 658–666

  51. [51]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

  52. [52]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

  53. [53]

    Learning to detect human-object interactions,

    Y . Chao, Y . Liu, X. Liu, H. Zeng, and J. Deng, “Learning to detect human-object interactions,” in2018 ieee winter conference on applica- tions of computer vision (wacv). IEEE, 2018, pp. 381–389

  54. [54]

    Agora: Avatars in geography optimized for regression analysis,

    P. Patel, C.-H. P. Huang, J. Tesch, D. T. Hoffmann, S. Tripathi, and M. J. Black, “Agora: Avatars in geography optimized for regression analysis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 468–13 478

  55. [55]

    Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion,

    M. J. Black, P. Patel, J. Tesch, and J. Yang, “Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8726–8737

  56. [56]

    Microsoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inComputer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13. Springer, 2014, pp. 740–755

  57. [57]

    2d human pose estimation: New benchmark and state of the art analysis,

    M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” inProceedings of the IEEE Conference on computer Vision and Pattern Recognition, 2014, pp. 3686–3693

  58. [58]

    Crowdpose: Efficient crowded scenes pose estimation and a new benchmark,

    J. Li, C. Wang, H. Zhu, Y . Mao, H.-S. Fang, and C. Lu, “Crowdpose: Efficient crowded scenes pose estimation and a new benchmark,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10 863–10 872

  59. [59]

    Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,

    C. Ionescu, D. Papava, V . Olaru, and C. Sminchisescu, “Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,”IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 7, pp. 1325–1339, 2013

  60. [60]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

  61. [61]

    Coherent reconstruction of multiple humans from a single image,

    W. Jiang, N. Kolotouros, G. Pavlakos, X. Zhou, and K. Daniilidis, “Coherent reconstruction of multiple humans from a single image,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5579–5588

  62. [62]

    Learning to estimate robust 3d human mesh from in-the-wild crowded scenes,

    H. Choi, G. Moon, J. Park, and K. M. Lee, “Learning to estimate robust 3d human mesh from in-the-wild crowded scenes,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1475–1484

  63. [63]

    Trace: 5d temporal regression of avatars with dynamic cameras in 3d environments,

    Y . Sun, Q. Bao, W. Liu, T. Mei, and M. J. Black, “Trace: 5d temporal regression of avatars with dynamic cameras in 3d environments,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8856–8866

  64. [64]

    Reconstructing groups of people with hypergraph relational reasoning,

    B. Huang, J. Ju, Z. Li, and Y . Wang, “Reconstructing groups of people with hypergraph relational reasoning,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 14 873–14 883

  65. [65]

    Reconstructing close human interaction with appearance and proxemics reasoning,

    B. Huang, C. Li, C. Xu, D. Lu, J. Chen, Y . Wang, and G. H. Lee, “Reconstructing close human interaction with appearance and proxemics reasoning,” inProceedings of the Computer Vision and Pattern Recog- nition Conference, 2025, pp. 17 475–17 485

  66. [66]

    Remips: Physically consistent 3d reconstruction of multiple interacting people under weak supervision,

    M. Fieraru, M. Zanfir, T. Szente, E. Bazavan, V . Olaru, and C. Smin- chisescu, “Remips: Physically consistent 3d reconstruction of multiple interacting people under weak supervision,”Advances in Neural Infor- mation Processing Systems, vol. 34, pp. 19 385–19 397, 2021