arxiv: 2605.04554 · v2 · submitted 2026-05-06 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

InterMesh: Explicit Interaction-Aware End-to-End Multi-Person Human Mesh Recovery

Kaili Zheng , Kaiwen Wang , Xun Zhu , Chenyi Guo , Ji Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:26 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-person human mesh recoveryhuman-object interactionpose estimationshape estimationDETR frameworkinteraction-aware modeling3D reconstruction

0 comments

The pith

InterMesh improves multi-person human mesh recovery by explicitly adding structured interaction semantics from a detector into query features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard DETR-style methods for recovering 3D meshes of multiple people only capture interactions implicitly through self-attention. InterMesh instead feeds explicit human-object and human-human interaction signals from a dedicated detector into two lightweight modules that update the person queries. These modules enrich the features used for pose and shape prediction without changing the core architecture much. Experiments on five datasets confirm lower error rates, especially where people touch or use objects. A reader would care because many real scenes involve such interactions, and better meshes support downstream uses like robotics and augmented reality.

Core claim

InterMesh explicitly incorporates human-environment interaction information into the human mesh recovery pipeline by leveraging a human-object interaction detector to enrich query representations with structured interaction semantics, achieved through lightweight Contextual Interaction Encoder and Interaction-Guided Refiner modules that integrate these features into existing HMR architectures with minimal overhead for more accurate pose and shape estimation.

What carries the argument

Contextual Interaction Encoder and Interaction-Guided Refiner modules that integrate structured interaction semantics from a human-object interaction detector into the person query representations.

If this is right

Multi-person mesh estimates become more accurate in scenes with object contacts and group interactions, as measured by MPJPE reductions of 9.9 percent on CMU Panoptic and 8.2 percent on Hi4D.
Existing DETR-based human mesh recovery pipelines can incorporate explicit interaction cues via small added modules without full retraining.
Performance gains appear across varied datasets including 3DPW, MuPoTS, CMU Panoptic, Hi4D, and CHI3D when interaction complexity is high.
Pose and shape outputs benefit from relational semantics beyond what implicit attention alone provides.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If interaction detectors continue to improve independently, mesh recovery accuracy would increase without retraining the mesh model itself.
The same enrichment pattern could be tested on single-person recovery or on tasks like action recognition that already use human meshes.
Real-time deployment would require the interaction detector to match the mesh model's speed so that total latency stays acceptable.
Similar explicit relational features might help other vision problems that currently rely only on self-attention for scene context.

Load-bearing premise

The human-object interaction detector supplies accurate, relevant interaction signals that can be fused directly to improve mesh estimates without introducing new errors.

What would settle it

Running the same pipeline on Hi4D or CMU Panoptic with the interaction detector outputs replaced by random or zeroed features and observing no drop or an increase in MPJPE would falsify the claim that the explicit semantics drive the accuracy gains.

Figures

Figures reproduced from arXiv: 2605.04554 by Chenyi Guo, Ji Wu, Kaili Zheng, Kaiwen Wang, Xun Zhu.

**Figure 1.** Figure 1: Overview of InterMesh. Unlike prior methods that rely solely on implicit modeling of inter-human relationships, view at source ↗

**Figure 2.** Figure 2: Architecture of a single decoder layer of InterMesh. Each layer takes as input human queries updated by self-attention view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on 3DPW. (a) 3DPW (b) Hi4D view at source ↗

**Figure 3.** Figure 3: Illustration of the attention masks used for batched training with [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Per-vertex error comparison on 3DPW and Hi4D. view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on 3DPW. (person, carry, person) Image SAT-HMR InterMesh (person, hold, person) (person, hug, person) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation study on the starting decoder layer for view at source ↗

**Figure 6.** Figure 6: Example of attention masks for (a) MHSA in Con view at source ↗

**Figure 6.** Figure 6: Visualization results on CMU Panoptic [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison on Hi4D. SAT-HMR mainly due to the additional HOI detection component. However, compared to methods such as SAT-HMR that only estimate human meshes, InterMesh also produces fine-grained HOI labels, which provides additional interpretable interaction outputs. The increase in complexity within the HMR model itself is moderate. This allows InterMesh to directly benefit from future im… view at source ↗

**Figure 7.** Figure 7: Visualization results on MuPoTS. of object context that is entirely absent in prior multi-person HMR methods. Combining both interaction types achieves the best results across all metrics, demonstrating that humanhuman and human-object cues provide complementary information for more accurate pose and shape estimation. We further investigate the impact of incorporating humanenvironment interaction featur… view at source ↗

**Figure 8.** Figure 8: More visualization results on 3DPW view at source ↗

**Figure 8.** Figure 8: Visualization results on in-the-wild samples. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Visualization results on CMU Panoptic view at source ↗

**Figure 9.** Figure 9: Per-vertex error comparison on 3DPW and Hi4D. Green regions [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization results on MuPoTS view at source ↗

**Figure 10.** Figure 10: Ablation study on the starting decoder layer for incorporating human [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: Visualization results on in-the-wild samples. view at source ↗

read the original abstract

Humans constantly interact with their surroundings. Existing end-to-end multi-person human mesh recovery methods, typically based on the DETR framework, capture inter-human relationships through self-attention across all human queries. However, these approaches model interactions only implicitly and lack explicit reasoning about how humans interact with objects and with each other. In this paper, we propose InterMesh, a simple yet effective framework that explicitly incorporates human-environment interaction information into human mesh recovery pipeline. By leveraging a human-object interaction detector, InterMesh enriches query representations with structured interaction semantics, enabling more accurate pose and shape estimation. We design lightweight modules, Contextual Interaction Encoder and Interaction-Guided Refiner, to integrate these features into existing HMR architectures with minimal overhead. We validate our approach through extensive experiments on 3DPW, MuPoTS, CMU Panoptic, Hi4D, and CHI3D datasets, demonstrating remarkable improvements over state-of-the-art methods. Notably, InterMesh reduces MPJPE by 9.9% on CMU Panoptic and 8.2% on Hi4D, highlighting its effectiveness in scenarios with complex human-object and inter-human interactions. Code and models are released at https://github.com/Kelly510/InterMesh.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InterMesh adds an explicit HOI detector to DETR queries for multi-person mesh recovery and reports gains on interaction-heavy datasets, but leaves detector error propagation untested.

read the letter

InterMesh takes the standard DETR multi-person HMR pipeline and inserts an off-the-shelf human-object interaction detector to supply structured semantics to the queries. That explicit step is the main difference from the implicit self-attention used in prior work. The paper shows consistent improvements across five datasets, with the largest drops on CMU Panoptic (9.9% MPJPE) and Hi4D (8.2% MPJPE), and releases code, which helps reproducibility. The added modules are lightweight, so the overhead stays small and the approach stays easy to plug into existing architectures.

Referee Report

1 major / 1 minor

Summary. The paper introduces InterMesh, an end-to-end multi-person human mesh recovery framework that augments DETR-style query representations with explicit human-object interaction semantics extracted from an off-the-shelf HOI detector. It proposes lightweight Contextual Interaction Encoder and Interaction-Guided Refiner modules to fuse these features into existing HMR pipelines and reports consistent gains over prior methods on 3DPW, MuPoTS, CMU Panoptic, Hi4D, and CHI3D, including 9.9% MPJPE reduction on CMU Panoptic and 8.2% on Hi4D, while releasing code.

Significance. If the gains prove robust, the work would usefully extend implicit self-attention baselines by injecting structured interaction priors, with particular value in crowded or object-rich scenes. The public release of code and models is a clear strength that supports reproducibility.

major comments (1)

[Experiments] Experiments section: the central claim that the HOI detector supplies accurate, noise-free semantics that the Contextual Interaction Encoder and Interaction-Guided Refiner can directly exploit rests on an untested assumption. No ablation replaces the detector with noisy, weaker, or random oracles, nor is error propagation quantified; the reported MPJPE reductions could therefore be sensitive to detector quality in occluded or crowded scenes.

minor comments (1)

[Abstract] Abstract: the phrase 'remarkable improvements' is subjective; replace with precise percentage gains and baseline names for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the experimental validation of the HOI detector's role. We agree that additional analysis is warranted and will incorporate the suggested ablation in the revised manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: the central claim that the HOI detector supplies accurate, noise-free semantics that the Contextual Interaction Encoder and Interaction-Guided Refiner can directly exploit rests on an untested assumption. No ablation replaces the detector with noisy, weaker, or random oracles, nor is error propagation quantified; the reported MPJPE reductions could therefore be sensitive to detector quality in occluded or crowded scenes.

Authors: We agree that testing robustness to HOI detector quality strengthens the claims. In the revision we will add an ablation that replaces the off-the-shelf HOI detector outputs with controlled noisy versions (random labels, reduced-accuracy oracles, and simulated occlusion-induced errors) and report the resulting MPJPE changes on CMU Panoptic and Hi4D. This will quantify error propagation through the Contextual Interaction Encoder and Interaction-Guided Refiner. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline with external detector inputs

full rationale

The paper describes an end-to-end architecture that feeds outputs from a pre-existing human-object interaction detector into Contextual Interaction Encoder and Interaction-Guided Refiner modules within a DETR-style HMR backbone. No equations, fitted parameters, or self-citation chains are presented that reduce the claimed MPJPE improvements to quantities defined inside the same model. Validation occurs on independent external benchmarks (3DPW, MuPoTS, CMU Panoptic, Hi4D, CHI3D) with reported gains over prior SOTA, satisfying the criteria for a self-contained empirical contribution without load-bearing self-definition or renaming of known results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a pre-trained human-object interaction detector supplies useful structured features and that standard end-to-end training on the listed datasets will integrate those features effectively.

free parameters (1)

interaction integration weights
Learned parameters inside the Contextual Interaction Encoder and Interaction-Guided Refiner that control how detector outputs are fused.

axioms (1)

domain assumption The human-object interaction detector produces reliable and task-relevant semantics
The pipeline depends on this external detector to enrich queries; its accuracy is taken as given.

pith-pipeline@v0.9.0 · 5531 in / 1372 out tokens · 82537 ms · 2026-05-15T06:26:10.329284+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By leveraging a human-object interaction detector, InterMesh enriches query representations with structured interaction semantics... Contextual Interaction Encoder and Interaction-Guided Refiner
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

reduces MPJPE by 9.9% on CMU Panoptic and 8.2% on Hi4D

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 4 internal anchors

[1]

Smpl: A skinned multi-person linear model,

M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: A skinned multi-person linear model,”ACM Transactions on Graphics, vol. 34, no. 6, 2015

work page 2015
[2]

Photo wake- up: 3d character animation from a single photo,

C.-Y . Weng, B. Curless, and I. Kemelmacher-Shlizerman, “Photo wake- up: 3d character animation from a single photo,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5908–5917

work page 2019
[3]

Learning human dynamics in autonomous driving scenarios,

J. Wang, Y . Yuan, Z. Luo, K. Xie, D. Lin, U. Iqbal, S. Fidler, and S. Khamis, “Learning human dynamics in autonomous driving scenarios,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 20 796–20 806

work page 2023
[4]

Posecoach: A customizable analysis and visualization system for video-based running coaching,

J. Liu, N. Saquib, C. Zhutian, R. H. Kazi, L.-Y . Wei, H. Fu, and C.-L. Tai, “Posecoach: A customizable analysis and visualization system for video-based running coaching,”IEEE Transactions on Visualization and Computer Graphics, vol. 30, no. 7, pp. 3180–3195, 2022

work page 2022
[5]

Icon: Implicit clothed humans obtained from normals,

Y . Xiu, J. Yang, D. Tzionas, and M. J. Black, “Icon: Implicit clothed humans obtained from normals,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022, pp. 13 286–13 296

work page 2022
[6]

Learning to reconstruct 3d human pose and shape via model-fitting in the loop,

N. Kolotouros, G. Pavlakos, M. J. Black, and K. Daniilidis, “Learning to reconstruct 3d human pose and shape via model-fitting in the loop,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 2252–2261

work page 2019
[7]

Vibe: Video inference for human body pose and shape estimation,

M. Kocabas, N. Athanasiou, and M. J. Black, “Vibe: Video inference for human body pose and shape estimation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5253–5263. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12 Method Parameters (M) FLOPs (G) Runtime (ms) HMR Model HOI Detector HMR Model HO...

work page 2020
[8]

Pare: Part attention regressor for 3d human body estimation,

M. Kocabas, C.-H. P. Huang, O. Hilliges, and M. J. Black, “Pare: Part attention regressor for 3d human body estimation,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 11 127–11 137

work page 2021
[9]

3d human mesh reconstruction by learning to sample joint adaptive tokens for transformers,

Y . Xue, J. Chen, Y . Zhang, C. Yu, H. Ma, and H. Ma, “3d human mesh reconstruction by learning to sample joint adaptive tokens for transformers,” inProceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 6765–6773

work page 2022
[10]

Cliff: Carrying location information in full frames into human pose and shape estimation,

Z. Li, J. Liu, Z. Zhang, S. Xu, and Y . Yan, “Cliff: Carrying location information in full frames into human pose and shape estimation,” in European Conference on Computer Vision. Springer, 2022, pp. 590– 606

work page 2022
[11]

Humans in 4d: Reconstructing and tracking humans with transformers,

S. Goel, G. Pavlakos, J. Rajasegaran, A. Kanazawa, and J. Malik, “Humans in 4d: Reconstructing and tracking humans with transformers,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 14 783–14 794

work page 2023
[12]

Smpler-x: Scaling up expressive human pose and shape estimation,

Z. Cai, W. Yin, A. Zeng, C. Wei, Q. Sun, W. Yanjun, H. E. Pang, H. Mei, M. Zhang, L. Zhanget al., “Smpler-x: Scaling up expressive human pose and shape estimation,”Advances in Neural Information Processing Systems, vol. 36, pp. 11 454–11 468, 2023

work page 2023
[13]

Body meshes as points,

J. Zhang, D. Yu, J. H. Liew, X. Nie, and J. Feng, “Body meshes as points,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 546–556

work page 2021
[14]

Monocular, one-stage, regression of multiple 3d people,

Y . Sun, Q. Bao, W. Liu, Y . Fu, M. J. Black, and T. Mei, “Monocular, one-stage, regression of multiple 3d people,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 11 179–11 188

work page 2021
[15]

Putting people in their place: Monocular regression of 3d people in depth,

Y . Sun, W. Liu, Q. Bao, Y . Fu, T. Mei, and M. J. Black, “Putting people in their place: Monocular regression of 3d people in depth,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 243–13 252

work page 2022
[16]

Psvt: End-to-end multi-person 3d pose and shape estimation with progressive video transformers,

Z. Qiu, Q. Yang, J. Wang, H. Feng, J. Han, E. Ding, C. Xu, D. Fu, and J. Wang, “Psvt: End-to-end multi-person 3d pose and shape estimation with progressive video transformers,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 21 254–21 263

work page 2023
[17]

Aios: All-in-one-stage expressive human pose and shape estimation,

Q. Sun, Y . Wang, A. Zeng, W. Yin, C. Wei, W. Wang, H. Mei, C.-S. Leung, Z. Liu, L. Yanget al., “Aios: All-in-one-stage expressive human pose and shape estimation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 1834–1843

work page 2024
[18]

Multi-hmr: Multi-person whole-body human mesh recovery in a single shot,

F. Baradel, M. Armando, S. Galaaoui, R. Br ´egier, P. Weinzaepfel, G. Rogez, and T. Lucas, “Multi-hmr: Multi-person whole-body human mesh recovery in a single shot,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 202–218

work page 2024
[19]

Sat-hmr: Real-time multi- person 3d mesh estimation via scale-adaptive tokens,

C. Su, X. Ma, J. Su, and Y . Wang, “Sat-hmr: Real-time multi- person 3d mesh estimation via scale-adaptive tokens,”arXiv preprint arXiv:2411.19824, 2024

work page arXiv 2024
[20]

End-to-end object detection with transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision. Springer, 2020, pp. 213– 229

work page 2020
[21]

Reconstructing three-dimensional models of interacting humans,

M. Fieraru, M. Zanfir, E. Oneata, A.-I. Popa, V . Olaru, and C. Sminchis- escu, “Reconstructing three-dimensional models of interacting humans,” arXiv preprint arXiv:2308.01854, 2023

work page arXiv 2023
[22]

Closely interactive human reconstruction with proxemics and physics-guided adaption,

B. Huang, C. Li, C. Xu, L. Pan, Y . Wang, and G. H. Lee, “Closely interactive human reconstruction with proxemics and physics-guided adaption,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1011–1021

work page 2024
[23]

Populating 3d scenes by learning human-scene interaction,

M. Hassan, P. Ghosh, J. Tesch, D. Tzionas, and M. J. Black, “Populating 3d scenes by learning human-scene interaction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 708–14 718

work page 2021
[24]

Chore: Contact, human and object reconstruction from a single rgb image,

X. Xie, B. L. Bhatnagar, and G. Pons-Moll, “Chore: Contact, human and object reconstruction from a single rgb image,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 125–145

work page 2022
[25]

Deco: Dense estimation of 3d human-scene contact in the wild,

S. Tripathi, A. Chatterjee, J.-C. Passy, H. Yi, D. Tzionas, and M. J. Black, “Deco: Dense estimation of 3d human-scene contact in the wild,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8001–8013

work page 2023
[26]

Generative proxemics: A prior for 3d social interaction from images,

L. M ¨uller, V . Ye, G. Pavlakos, M. Black, and A. Kanazawa, “Generative proxemics: A prior for 3d social interaction from images,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9687–9697

work page 2024
[27]

Capturing closely interacted two-person motions with reaction priors,

Q. Fang, Y . Fan, Y . Li, J. Dong, D. Wu, W. Zhang, and K. Chen, “Capturing closely interacted two-person motions with reaction priors,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 655–665

work page 2024
[28]

Learning human mesh recovery in 3d scenes,

Z. Shen, Z. Cen, S. Peng, Q. Shuai, H. Bao, and X. Zhou, “Learning human mesh recovery in 3d scenes,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 038–17 047

work page 2023
[29]

Probabilistic human mesh recovery in 3d scenes from egocentric views,

S. Zhang, Q. Ma, Y . Zhang, S. Aliakbarian, D. Cosker, and S. Tang, “Probabilistic human mesh recovery in 3d scenes from egocentric views,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7989–8000

work page 2023
[30]

Ez-hoi: Vlm adaptation via guided prompt learning for zero-shot hoi detection,

Q. Lei, B. Wang, and R. Tan, “Ez-hoi: Vlm adaptation via guided prompt learning for zero-shot hoi detection,”Advances in Neural Information Processing Systems, vol. 37, pp. 55 831–55 857, 2024

work page 2024
[31]

Camerahmr: Aligning people with perspec- tive,

P. Patel and M. J. Black, “Camerahmr: Aligning people with perspec- tive,”arXiv preprint arXiv:2411.08128, 2024

work page arXiv 2024
[32]

A twist representation and shape refinement method for human mesh recovery,

X. Hao, H. Li, J. Sun, L. Wang, and J. Fan, “A twist representation and shape refinement method for human mesh recovery,”IEEE Transactions on Multimedia, vol. 27, pp. 1821–1834, 2025

work page 2025
[33]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

work page 2016
[34]

Deformable DETR: Deformable Transformers for End-to-End Object Detection

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,”arXiv preprint arXiv:2010.04159, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[35]

Dab-detr: Dynamic anchor boxes are better queries for detr,

S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang, “Dab-detr: Dynamic anchor boxes are better queries for detr,”arXiv preprint arXiv:2201.12329, 2022

work page arXiv 2022
[36]

Prompthmr: Promptable human mesh recovery,

Y . Wang, Y . Sun, P. Patel, K. Daniilidis, M. J. Black, and M. Kocabas, “Prompthmr: Promptable human mesh recovery,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 1148– 1159

work page 2025
[37]

Three-dimensional reconstruction of human interactions,

M. Fieraru, M. Zanfir, E. Oneata, A.-I. Popa, V . Olaru, and C. Smin- chisescu, “Three-dimensional reconstruction of human interactions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 7214–7223

work page 2020
[38]

Hi4d: 4d instance segmentation of close human interaction,

Y . Yin, C. Guo, M. Kaufmann, J. J. Zarate, J. Song, and O. Hilliges, “Hi4d: 4d instance segmentation of close human interaction,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 016–17 027

work page 2023
[39]

Interact as you intend: Intention-driven human-object interaction detection,

B. Xu, J. Li, Y . Wong, Q. Zhao, and M. S. Kankanhalli, “Interact as you intend: Intention-driven human-object interaction detection,”IEEE Transactions on Multimedia, vol. 22, no. 6, pp. 1423–1432, 2020

work page 2020
[40]

Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer,

F. Z. Zhang, D. Campbell, and S. Gould, “Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 20 104–20 112

work page 2022
[41]

Efficient adaptive human-object interaction detection with concept-guided memory,

T. Lei, F. Caba, Q. Chen, H. Jin, Y . Peng, and Y . Liu, “Efficient adaptive human-object interaction detection with concept-guided memory,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 6480–6490

work page 2023
[42]

Hodn: Disentangling human-object feature for hoi detection,

S. Fang, Z. Lin, K. Yan, J. Li, X. Lin, and R. Ji, “Hodn: Disentangling human-object feature for hoi detection,”IEEE Transactions on Multi- media, vol. 26, pp. 3125–3136, 2024

work page 2024
[43]

Clip4hoi: Towards JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13 adapting clip for practical zero-shot hoi detection,

Y . Mao, J. Deng, W. Zhou, L. Li, Y . Fang, and H. Li, “Clip4hoi: Towards JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13 adapting clip for practical zero-shot hoi detection,”Advances in Neural Information Processing Systems, vol. 36, pp. 45 895–45 906, 2023

work page 2021
[44]

Ask-hoi: Affordance- scene knowledge prompting for human-object interaction detection,

D. Chen, D. Kong, J. Gao, J. Li, Q. Li, and B. Yin, “Ask-hoi: Affordance- scene knowledge prompting for human-object interaction detection,” IEEE Transactions on Multimedia, vol. 28, pp. 742–756, 2025

work page 2025
[45]

Glamr: Global occlusion-aware human mesh recovery with dynamic cameras,

Y . Yuan, U. Iqbal, P. Molchanov, K. Kitani, and J. Kautz, “Glamr: Global occlusion-aware human mesh recovery with dynamic cameras,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11 038–11 049

work page 2022
[46]

Decoupling human and camera motion from videos in the wild,

V . Ye, G. Pavlakos, J. Malik, and A. Kanazawa, “Decoupling human and camera motion from videos in the wild,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 21 222–21 232

work page 2023
[47]

Wham: Reconstructing world-grounded humans with accurate 3d motion,

S. Shin, J. Kim, E. Halilaj, and M. J. Black, “Wham: Reconstructing world-grounded humans with accurate 3d motion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 2070–2080

work page 2024
[48]

Syn- ergistic global-space camera and human reconstruction from videos,

Y . Zhao, T. Y . Wang, B. Raj, M. Xu, J. Yang, and C.-H. P. Huang, “Syn- ergistic global-space camera and human reconstruction from videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1216–1226

work page 2024
[49]

The hungarian method for the assignment problem,

H. W. Kuhn, “The hungarian method for the assignment problem,”Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955

work page 1955
[50]

Generalized intersection over union: A metric and a loss for bounding box regression,

H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” inProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, 2019, pp. 658–666

work page 2019
[51]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[52]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Learning to detect human-object interactions,

Y . Chao, Y . Liu, X. Liu, H. Zeng, and J. Deng, “Learning to detect human-object interactions,” in2018 ieee winter conference on applica- tions of computer vision (wacv). IEEE, 2018, pp. 381–389

work page 2018
[54]

Agora: Avatars in geography optimized for regression analysis,

P. Patel, C.-H. P. Huang, J. Tesch, D. T. Hoffmann, S. Tripathi, and M. J. Black, “Agora: Avatars in geography optimized for regression analysis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 468–13 478

work page 2021
[55]

Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion,

M. J. Black, P. Patel, J. Tesch, and J. Yang, “Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8726–8737

work page 2023
[56]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inComputer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13. Springer, 2014, pp. 740–755

work page 2014
[57]

2d human pose estimation: New benchmark and state of the art analysis,

M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” inProceedings of the IEEE Conference on computer Vision and Pattern Recognition, 2014, pp. 3686–3693

work page 2014
[58]

Crowdpose: Efficient crowded scenes pose estimation and a new benchmark,

J. Li, C. Wang, H. Zhu, Y . Mao, H.-S. Fang, and C. Lu, “Crowdpose: Efficient crowded scenes pose estimation and a new benchmark,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10 863–10 872

work page 2019
[59]

Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,

C. Ionescu, D. Papava, V . Olaru, and C. Sminchisescu, “Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,”IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 7, pp. 1325–1339, 2013

work page 2013
[60]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[61]

Coherent reconstruction of multiple humans from a single image,

W. Jiang, N. Kolotouros, G. Pavlakos, X. Zhou, and K. Daniilidis, “Coherent reconstruction of multiple humans from a single image,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5579–5588

work page 2020
[62]

Learning to estimate robust 3d human mesh from in-the-wild crowded scenes,

H. Choi, G. Moon, J. Park, and K. M. Lee, “Learning to estimate robust 3d human mesh from in-the-wild crowded scenes,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1475–1484

work page 2022
[63]

Trace: 5d temporal regression of avatars with dynamic cameras in 3d environments,

Y . Sun, Q. Bao, W. Liu, T. Mei, and M. J. Black, “Trace: 5d temporal regression of avatars with dynamic cameras in 3d environments,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8856–8866

work page 2023
[64]

Reconstructing groups of people with hypergraph relational reasoning,

B. Huang, J. Ju, Z. Li, and Y . Wang, “Reconstructing groups of people with hypergraph relational reasoning,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 14 873–14 883

work page 2023
[65]

Reconstructing close human interaction with appearance and proxemics reasoning,

B. Huang, C. Li, C. Xu, D. Lu, J. Chen, Y . Wang, and G. H. Lee, “Reconstructing close human interaction with appearance and proxemics reasoning,” inProceedings of the Computer Vision and Pattern Recog- nition Conference, 2025, pp. 17 475–17 485

work page 2025
[66]

Remips: Physically consistent 3d reconstruction of multiple interacting people under weak supervision,

M. Fieraru, M. Zanfir, T. Szente, E. Bazavan, V . Olaru, and C. Smin- chisescu, “Remips: Physically consistent 3d reconstruction of multiple interacting people under weak supervision,”Advances in Neural Infor- mation Processing Systems, vol. 34, pp. 19 385–19 397, 2021

work page 2021