Recognition: 2 theorem links
· Lean TheoremInterMesh: Explicit Interaction-Aware End-to-End Multi-Person Human Mesh Recovery
Pith reviewed 2026-05-15 06:26 UTC · model grok-4.3
The pith
InterMesh improves multi-person human mesh recovery by explicitly adding structured interaction semantics from a detector into query features.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
InterMesh explicitly incorporates human-environment interaction information into the human mesh recovery pipeline by leveraging a human-object interaction detector to enrich query representations with structured interaction semantics, achieved through lightweight Contextual Interaction Encoder and Interaction-Guided Refiner modules that integrate these features into existing HMR architectures with minimal overhead for more accurate pose and shape estimation.
What carries the argument
Contextual Interaction Encoder and Interaction-Guided Refiner modules that integrate structured interaction semantics from a human-object interaction detector into the person query representations.
If this is right
- Multi-person mesh estimates become more accurate in scenes with object contacts and group interactions, as measured by MPJPE reductions of 9.9 percent on CMU Panoptic and 8.2 percent on Hi4D.
- Existing DETR-based human mesh recovery pipelines can incorporate explicit interaction cues via small added modules without full retraining.
- Performance gains appear across varied datasets including 3DPW, MuPoTS, CMU Panoptic, Hi4D, and CHI3D when interaction complexity is high.
- Pose and shape outputs benefit from relational semantics beyond what implicit attention alone provides.
Where Pith is reading between the lines
- If interaction detectors continue to improve independently, mesh recovery accuracy would increase without retraining the mesh model itself.
- The same enrichment pattern could be tested on single-person recovery or on tasks like action recognition that already use human meshes.
- Real-time deployment would require the interaction detector to match the mesh model's speed so that total latency stays acceptable.
- Similar explicit relational features might help other vision problems that currently rely only on self-attention for scene context.
Load-bearing premise
The human-object interaction detector supplies accurate, relevant interaction signals that can be fused directly to improve mesh estimates without introducing new errors.
What would settle it
Running the same pipeline on Hi4D or CMU Panoptic with the interaction detector outputs replaced by random or zeroed features and observing no drop or an increase in MPJPE would falsify the claim that the explicit semantics drive the accuracy gains.
Figures
read the original abstract
Humans constantly interact with their surroundings. Existing end-to-end multi-person human mesh recovery methods, typically based on the DETR framework, capture inter-human relationships through self-attention across all human queries. However, these approaches model interactions only implicitly and lack explicit reasoning about how humans interact with objects and with each other. In this paper, we propose InterMesh, a simple yet effective framework that explicitly incorporates human-environment interaction information into human mesh recovery pipeline. By leveraging a human-object interaction detector, InterMesh enriches query representations with structured interaction semantics, enabling more accurate pose and shape estimation. We design lightweight modules, Contextual Interaction Encoder and Interaction-Guided Refiner, to integrate these features into existing HMR architectures with minimal overhead. We validate our approach through extensive experiments on 3DPW, MuPoTS, CMU Panoptic, Hi4D, and CHI3D datasets, demonstrating remarkable improvements over state-of-the-art methods. Notably, InterMesh reduces MPJPE by 9.9% on CMU Panoptic and 8.2% on Hi4D, highlighting its effectiveness in scenarios with complex human-object and inter-human interactions. Code and models are released at https://github.com/Kelly510/InterMesh.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces InterMesh, an end-to-end multi-person human mesh recovery framework that augments DETR-style query representations with explicit human-object interaction semantics extracted from an off-the-shelf HOI detector. It proposes lightweight Contextual Interaction Encoder and Interaction-Guided Refiner modules to fuse these features into existing HMR pipelines and reports consistent gains over prior methods on 3DPW, MuPoTS, CMU Panoptic, Hi4D, and CHI3D, including 9.9% MPJPE reduction on CMU Panoptic and 8.2% on Hi4D, while releasing code.
Significance. If the gains prove robust, the work would usefully extend implicit self-attention baselines by injecting structured interaction priors, with particular value in crowded or object-rich scenes. The public release of code and models is a clear strength that supports reproducibility.
major comments (1)
- [Experiments] Experiments section: the central claim that the HOI detector supplies accurate, noise-free semantics that the Contextual Interaction Encoder and Interaction-Guided Refiner can directly exploit rests on an untested assumption. No ablation replaces the detector with noisy, weaker, or random oracles, nor is error propagation quantified; the reported MPJPE reductions could therefore be sensitive to detector quality in occluded or crowded scenes.
minor comments (1)
- [Abstract] Abstract: the phrase 'remarkable improvements' is subjective; replace with precise percentage gains and baseline names for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on the experimental validation of the HOI detector's role. We agree that additional analysis is warranted and will incorporate the suggested ablation in the revised manuscript.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the central claim that the HOI detector supplies accurate, noise-free semantics that the Contextual Interaction Encoder and Interaction-Guided Refiner can directly exploit rests on an untested assumption. No ablation replaces the detector with noisy, weaker, or random oracles, nor is error propagation quantified; the reported MPJPE reductions could therefore be sensitive to detector quality in occluded or crowded scenes.
Authors: We agree that testing robustness to HOI detector quality strengthens the claims. In the revision we will add an ablation that replaces the off-the-shelf HOI detector outputs with controlled noisy versions (random labels, reduced-accuracy oracles, and simulated occlusion-induced errors) and report the resulting MPJPE changes on CMU Panoptic and Hi4D. This will quantify error propagation through the Contextual Interaction Encoder and Interaction-Guided Refiner. revision: yes
Circularity Check
No significant circularity; empirical pipeline with external detector inputs
full rationale
The paper describes an end-to-end architecture that feeds outputs from a pre-existing human-object interaction detector into Contextual Interaction Encoder and Interaction-Guided Refiner modules within a DETR-style HMR backbone. No equations, fitted parameters, or self-citation chains are presented that reduce the claimed MPJPE improvements to quantities defined inside the same model. Validation occurs on independent external benchmarks (3DPW, MuPoTS, CMU Panoptic, Hi4D, CHI3D) with reported gains over prior SOTA, satisfying the criteria for a self-contained empirical contribution without load-bearing self-definition or renaming of known results.
Axiom & Free-Parameter Ledger
free parameters (1)
- interaction integration weights
axioms (1)
- domain assumption The human-object interaction detector produces reliable and task-relevant semantics
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By leveraging a human-object interaction detector, InterMesh enriches query representations with structured interaction semantics... Contextual Interaction Encoder and Interaction-Guided Refiner
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
reduces MPJPE by 9.9% on CMU Panoptic and 8.2% on Hi4D
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Smpl: A skinned multi-person linear model,
M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: A skinned multi-person linear model,”ACM Transactions on Graphics, vol. 34, no. 6, 2015
work page 2015
-
[2]
Photo wake- up: 3d character animation from a single photo,
C.-Y . Weng, B. Curless, and I. Kemelmacher-Shlizerman, “Photo wake- up: 3d character animation from a single photo,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5908–5917
work page 2019
-
[3]
Learning human dynamics in autonomous driving scenarios,
J. Wang, Y . Yuan, Z. Luo, K. Xie, D. Lin, U. Iqbal, S. Fidler, and S. Khamis, “Learning human dynamics in autonomous driving scenarios,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 20 796–20 806
work page 2023
-
[4]
Posecoach: A customizable analysis and visualization system for video-based running coaching,
J. Liu, N. Saquib, C. Zhutian, R. H. Kazi, L.-Y . Wei, H. Fu, and C.-L. Tai, “Posecoach: A customizable analysis and visualization system for video-based running coaching,”IEEE Transactions on Visualization and Computer Graphics, vol. 30, no. 7, pp. 3180–3195, 2022
work page 2022
-
[5]
Icon: Implicit clothed humans obtained from normals,
Y . Xiu, J. Yang, D. Tzionas, and M. J. Black, “Icon: Implicit clothed humans obtained from normals,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022, pp. 13 286–13 296
work page 2022
-
[6]
Learning to reconstruct 3d human pose and shape via model-fitting in the loop,
N. Kolotouros, G. Pavlakos, M. J. Black, and K. Daniilidis, “Learning to reconstruct 3d human pose and shape via model-fitting in the loop,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 2252–2261
work page 2019
-
[7]
Vibe: Video inference for human body pose and shape estimation,
M. Kocabas, N. Athanasiou, and M. J. Black, “Vibe: Video inference for human body pose and shape estimation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5253–5263. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12 Method Parameters (M) FLOPs (G) Runtime (ms) HMR Model HOI Detector HMR Model HO...
work page 2020
-
[8]
Pare: Part attention regressor for 3d human body estimation,
M. Kocabas, C.-H. P. Huang, O. Hilliges, and M. J. Black, “Pare: Part attention regressor for 3d human body estimation,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 11 127–11 137
work page 2021
-
[9]
3d human mesh reconstruction by learning to sample joint adaptive tokens for transformers,
Y . Xue, J. Chen, Y . Zhang, C. Yu, H. Ma, and H. Ma, “3d human mesh reconstruction by learning to sample joint adaptive tokens for transformers,” inProceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 6765–6773
work page 2022
-
[10]
Cliff: Carrying location information in full frames into human pose and shape estimation,
Z. Li, J. Liu, Z. Zhang, S. Xu, and Y . Yan, “Cliff: Carrying location information in full frames into human pose and shape estimation,” in European Conference on Computer Vision. Springer, 2022, pp. 590– 606
work page 2022
-
[11]
Humans in 4d: Reconstructing and tracking humans with transformers,
S. Goel, G. Pavlakos, J. Rajasegaran, A. Kanazawa, and J. Malik, “Humans in 4d: Reconstructing and tracking humans with transformers,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 14 783–14 794
work page 2023
-
[12]
Smpler-x: Scaling up expressive human pose and shape estimation,
Z. Cai, W. Yin, A. Zeng, C. Wei, Q. Sun, W. Yanjun, H. E. Pang, H. Mei, M. Zhang, L. Zhanget al., “Smpler-x: Scaling up expressive human pose and shape estimation,”Advances in Neural Information Processing Systems, vol. 36, pp. 11 454–11 468, 2023
work page 2023
-
[13]
J. Zhang, D. Yu, J. H. Liew, X. Nie, and J. Feng, “Body meshes as points,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 546–556
work page 2021
-
[14]
Monocular, one-stage, regression of multiple 3d people,
Y . Sun, Q. Bao, W. Liu, Y . Fu, M. J. Black, and T. Mei, “Monocular, one-stage, regression of multiple 3d people,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 11 179–11 188
work page 2021
-
[15]
Putting people in their place: Monocular regression of 3d people in depth,
Y . Sun, W. Liu, Q. Bao, Y . Fu, T. Mei, and M. J. Black, “Putting people in their place: Monocular regression of 3d people in depth,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 243–13 252
work page 2022
-
[16]
Psvt: End-to-end multi-person 3d pose and shape estimation with progressive video transformers,
Z. Qiu, Q. Yang, J. Wang, H. Feng, J. Han, E. Ding, C. Xu, D. Fu, and J. Wang, “Psvt: End-to-end multi-person 3d pose and shape estimation with progressive video transformers,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 21 254–21 263
work page 2023
-
[17]
Aios: All-in-one-stage expressive human pose and shape estimation,
Q. Sun, Y . Wang, A. Zeng, W. Yin, C. Wei, W. Wang, H. Mei, C.-S. Leung, Z. Liu, L. Yanget al., “Aios: All-in-one-stage expressive human pose and shape estimation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 1834–1843
work page 2024
-
[18]
Multi-hmr: Multi-person whole-body human mesh recovery in a single shot,
F. Baradel, M. Armando, S. Galaaoui, R. Br ´egier, P. Weinzaepfel, G. Rogez, and T. Lucas, “Multi-hmr: Multi-person whole-body human mesh recovery in a single shot,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 202–218
work page 2024
-
[19]
Sat-hmr: Real-time multi- person 3d mesh estimation via scale-adaptive tokens,
C. Su, X. Ma, J. Su, and Y . Wang, “Sat-hmr: Real-time multi- person 3d mesh estimation via scale-adaptive tokens,”arXiv preprint arXiv:2411.19824, 2024
-
[20]
End-to-end object detection with transformers,
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision. Springer, 2020, pp. 213– 229
work page 2020
-
[21]
Reconstructing three-dimensional models of interacting humans,
M. Fieraru, M. Zanfir, E. Oneata, A.-I. Popa, V . Olaru, and C. Sminchis- escu, “Reconstructing three-dimensional models of interacting humans,” arXiv preprint arXiv:2308.01854, 2023
-
[22]
Closely interactive human reconstruction with proxemics and physics-guided adaption,
B. Huang, C. Li, C. Xu, L. Pan, Y . Wang, and G. H. Lee, “Closely interactive human reconstruction with proxemics and physics-guided adaption,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1011–1021
work page 2024
-
[23]
Populating 3d scenes by learning human-scene interaction,
M. Hassan, P. Ghosh, J. Tesch, D. Tzionas, and M. J. Black, “Populating 3d scenes by learning human-scene interaction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 708–14 718
work page 2021
-
[24]
Chore: Contact, human and object reconstruction from a single rgb image,
X. Xie, B. L. Bhatnagar, and G. Pons-Moll, “Chore: Contact, human and object reconstruction from a single rgb image,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 125–145
work page 2022
-
[25]
Deco: Dense estimation of 3d human-scene contact in the wild,
S. Tripathi, A. Chatterjee, J.-C. Passy, H. Yi, D. Tzionas, and M. J. Black, “Deco: Dense estimation of 3d human-scene contact in the wild,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8001–8013
work page 2023
-
[26]
Generative proxemics: A prior for 3d social interaction from images,
L. M ¨uller, V . Ye, G. Pavlakos, M. Black, and A. Kanazawa, “Generative proxemics: A prior for 3d social interaction from images,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9687–9697
work page 2024
-
[27]
Capturing closely interacted two-person motions with reaction priors,
Q. Fang, Y . Fan, Y . Li, J. Dong, D. Wu, W. Zhang, and K. Chen, “Capturing closely interacted two-person motions with reaction priors,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 655–665
work page 2024
-
[28]
Learning human mesh recovery in 3d scenes,
Z. Shen, Z. Cen, S. Peng, Q. Shuai, H. Bao, and X. Zhou, “Learning human mesh recovery in 3d scenes,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 038–17 047
work page 2023
-
[29]
Probabilistic human mesh recovery in 3d scenes from egocentric views,
S. Zhang, Q. Ma, Y . Zhang, S. Aliakbarian, D. Cosker, and S. Tang, “Probabilistic human mesh recovery in 3d scenes from egocentric views,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7989–8000
work page 2023
-
[30]
Ez-hoi: Vlm adaptation via guided prompt learning for zero-shot hoi detection,
Q. Lei, B. Wang, and R. Tan, “Ez-hoi: Vlm adaptation via guided prompt learning for zero-shot hoi detection,”Advances in Neural Information Processing Systems, vol. 37, pp. 55 831–55 857, 2024
work page 2024
-
[31]
Camerahmr: Aligning people with perspec- tive,
P. Patel and M. J. Black, “Camerahmr: Aligning people with perspec- tive,”arXiv preprint arXiv:2411.08128, 2024
-
[32]
A twist representation and shape refinement method for human mesh recovery,
X. Hao, H. Li, J. Sun, L. Wang, and J. Fan, “A twist representation and shape refinement method for human mesh recovery,”IEEE Transactions on Multimedia, vol. 27, pp. 1821–1834, 2025
work page 2025
-
[33]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778
work page 2016
-
[34]
Deformable DETR: Deformable Transformers for End-to-End Object Detection
X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,”arXiv preprint arXiv:2010.04159, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[35]
Dab-detr: Dynamic anchor boxes are better queries for detr,
S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang, “Dab-detr: Dynamic anchor boxes are better queries for detr,”arXiv preprint arXiv:2201.12329, 2022
-
[36]
Prompthmr: Promptable human mesh recovery,
Y . Wang, Y . Sun, P. Patel, K. Daniilidis, M. J. Black, and M. Kocabas, “Prompthmr: Promptable human mesh recovery,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 1148– 1159
work page 2025
-
[37]
Three-dimensional reconstruction of human interactions,
M. Fieraru, M. Zanfir, E. Oneata, A.-I. Popa, V . Olaru, and C. Smin- chisescu, “Three-dimensional reconstruction of human interactions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 7214–7223
work page 2020
-
[38]
Hi4d: 4d instance segmentation of close human interaction,
Y . Yin, C. Guo, M. Kaufmann, J. J. Zarate, J. Song, and O. Hilliges, “Hi4d: 4d instance segmentation of close human interaction,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 016–17 027
work page 2023
-
[39]
Interact as you intend: Intention-driven human-object interaction detection,
B. Xu, J. Li, Y . Wong, Q. Zhao, and M. S. Kankanhalli, “Interact as you intend: Intention-driven human-object interaction detection,”IEEE Transactions on Multimedia, vol. 22, no. 6, pp. 1423–1432, 2020
work page 2020
-
[40]
Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer,
F. Z. Zhang, D. Campbell, and S. Gould, “Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 20 104–20 112
work page 2022
-
[41]
Efficient adaptive human-object interaction detection with concept-guided memory,
T. Lei, F. Caba, Q. Chen, H. Jin, Y . Peng, and Y . Liu, “Efficient adaptive human-object interaction detection with concept-guided memory,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 6480–6490
work page 2023
-
[42]
Hodn: Disentangling human-object feature for hoi detection,
S. Fang, Z. Lin, K. Yan, J. Li, X. Lin, and R. Ji, “Hodn: Disentangling human-object feature for hoi detection,”IEEE Transactions on Multi- media, vol. 26, pp. 3125–3136, 2024
work page 2024
-
[43]
Y . Mao, J. Deng, W. Zhou, L. Li, Y . Fang, and H. Li, “Clip4hoi: Towards JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13 adapting clip for practical zero-shot hoi detection,”Advances in Neural Information Processing Systems, vol. 36, pp. 45 895–45 906, 2023
work page 2021
-
[44]
Ask-hoi: Affordance- scene knowledge prompting for human-object interaction detection,
D. Chen, D. Kong, J. Gao, J. Li, Q. Li, and B. Yin, “Ask-hoi: Affordance- scene knowledge prompting for human-object interaction detection,” IEEE Transactions on Multimedia, vol. 28, pp. 742–756, 2025
work page 2025
-
[45]
Glamr: Global occlusion-aware human mesh recovery with dynamic cameras,
Y . Yuan, U. Iqbal, P. Molchanov, K. Kitani, and J. Kautz, “Glamr: Global occlusion-aware human mesh recovery with dynamic cameras,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11 038–11 049
work page 2022
-
[46]
Decoupling human and camera motion from videos in the wild,
V . Ye, G. Pavlakos, J. Malik, and A. Kanazawa, “Decoupling human and camera motion from videos in the wild,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 21 222–21 232
work page 2023
-
[47]
Wham: Reconstructing world-grounded humans with accurate 3d motion,
S. Shin, J. Kim, E. Halilaj, and M. J. Black, “Wham: Reconstructing world-grounded humans with accurate 3d motion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 2070–2080
work page 2024
-
[48]
Syn- ergistic global-space camera and human reconstruction from videos,
Y . Zhao, T. Y . Wang, B. Raj, M. Xu, J. Yang, and C.-H. P. Huang, “Syn- ergistic global-space camera and human reconstruction from videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1216–1226
work page 2024
-
[49]
The hungarian method for the assignment problem,
H. W. Kuhn, “The hungarian method for the assignment problem,”Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955
work page 1955
-
[50]
Generalized intersection over union: A metric and a loss for bounding box regression,
H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” inProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, 2019, pp. 658–666
work page 2019
-
[51]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[52]
DINOv2: Learning Robust Visual Features without Supervision
M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Learning to detect human-object interactions,
Y . Chao, Y . Liu, X. Liu, H. Zeng, and J. Deng, “Learning to detect human-object interactions,” in2018 ieee winter conference on applica- tions of computer vision (wacv). IEEE, 2018, pp. 381–389
work page 2018
-
[54]
Agora: Avatars in geography optimized for regression analysis,
P. Patel, C.-H. P. Huang, J. Tesch, D. T. Hoffmann, S. Tripathi, and M. J. Black, “Agora: Avatars in geography optimized for regression analysis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 468–13 478
work page 2021
-
[55]
Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion,
M. J. Black, P. Patel, J. Tesch, and J. Yang, “Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8726–8737
work page 2023
-
[56]
Microsoft coco: Common objects in context,
T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inComputer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13. Springer, 2014, pp. 740–755
work page 2014
-
[57]
2d human pose estimation: New benchmark and state of the art analysis,
M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” inProceedings of the IEEE Conference on computer Vision and Pattern Recognition, 2014, pp. 3686–3693
work page 2014
-
[58]
Crowdpose: Efficient crowded scenes pose estimation and a new benchmark,
J. Li, C. Wang, H. Zhu, Y . Mao, H.-S. Fang, and C. Lu, “Crowdpose: Efficient crowded scenes pose estimation and a new benchmark,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10 863–10 872
work page 2019
-
[59]
C. Ionescu, D. Papava, V . Olaru, and C. Sminchisescu, “Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,”IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 7, pp. 1325–1339, 2013
work page 2013
-
[60]
Decoupled Weight Decay Regularization
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[61]
Coherent reconstruction of multiple humans from a single image,
W. Jiang, N. Kolotouros, G. Pavlakos, X. Zhou, and K. Daniilidis, “Coherent reconstruction of multiple humans from a single image,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5579–5588
work page 2020
-
[62]
Learning to estimate robust 3d human mesh from in-the-wild crowded scenes,
H. Choi, G. Moon, J. Park, and K. M. Lee, “Learning to estimate robust 3d human mesh from in-the-wild crowded scenes,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1475–1484
work page 2022
-
[63]
Trace: 5d temporal regression of avatars with dynamic cameras in 3d environments,
Y . Sun, Q. Bao, W. Liu, T. Mei, and M. J. Black, “Trace: 5d temporal regression of avatars with dynamic cameras in 3d environments,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8856–8866
work page 2023
-
[64]
Reconstructing groups of people with hypergraph relational reasoning,
B. Huang, J. Ju, Z. Li, and Y . Wang, “Reconstructing groups of people with hypergraph relational reasoning,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 14 873–14 883
work page 2023
-
[65]
Reconstructing close human interaction with appearance and proxemics reasoning,
B. Huang, C. Li, C. Xu, D. Lu, J. Chen, Y . Wang, and G. H. Lee, “Reconstructing close human interaction with appearance and proxemics reasoning,” inProceedings of the Computer Vision and Pattern Recog- nition Conference, 2025, pp. 17 475–17 485
work page 2025
-
[66]
M. Fieraru, M. Zanfir, T. Szente, E. Bazavan, V . Olaru, and C. Smin- chisescu, “Remips: Physically consistent 3d reconstruction of multiple interacting people under weak supervision,”Advances in Neural Infor- mation Processing Systems, vol. 34, pp. 19 385–19 397, 2021
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.