arxiv: 2604.08921 · v1 · submitted 2026-04-10 · 💻 cs.CV

Recognition: unknown

TAIHRI: Task-Aware 3D Human Keypoints Localization for Close-Range Human-Robot Interaction

Ao Li , Yonggen Ling , Yiyang Lin , Yuji Wang , Yong Deng , Yansong Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D human keypointshuman-robot interactionvision-language modelegocentric perceptiontask-aware localizationnext-token predictionquantized 3D space

0 comments

The pith

A vision-language model localizes task-critical 3D human body parts for robots by turning positions into discrete tokens for next-token prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TAIHRI, a vision-language model built for close-range human-robot interaction that shifts attention from full-body 3D reconstruction to the precise placement of only the body parts a robot needs for a given task. It works by dividing possible 3D keypoint locations into a limited set of slots and then using the model's ability to predict the next token to reason from 2D image features to accurate metric 3D coordinates under the robot's own camera view. This lets the system interpret user motions and connect directly to actions such as spoken commands or full-body mesh building. The approach targets egocentric benchmarks where conventional methods fall short on the specific parts that matter most during physical contact. If the method succeeds, robots would gain more reliable spatial awareness of hands, arms, and other interaction zones without needing separate depth hardware.

Core claim

TAIHRI is the first vision-language model tailored for close-range HRI perception, capable of understanding users' motion commands and directing the robot's attention to the most task-relevant keypoints. By quantizing 3D keypoints into a finite interaction space, TAIHRI precisely localizes the 3D spatial coordinates of critical body parts by 2D keypoint reasoning via next token prediction, and seamlessly adapts to downstream tasks such as natural language control or global space human mesh recovery. Experiments on egocentric interaction benchmarks demonstrate that TAIHRI achieves superior estimation accuracy for task-critical body parts.

What carries the argument

Quantization of 3D keypoints into a finite interaction space together with 2D keypoint reasoning performed as next-token prediction inside a vision-language model.

If this is right

Robots can use natural language commands to control actions that depend on exact locations of hands or other body parts.
The localized keypoints can feed directly into full human mesh recovery in the robot's global coordinate frame.
Estimation accuracy improves specifically on the body parts that matter most for safe physical contact rather than on average over the whole body.
The same token-prediction pipeline supports adaptation to multiple HRI downstream tasks without retraining from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same discretization and token-prediction pattern could be tested on other continuous 3D sensing problems such as object pose estimation in manipulation.
Robots might operate with lower reliance on dedicated depth cameras if 2D reasoning inside large models proves sufficient for metric accuracy.
Task-aware selection of which keypoints to localize could become a general principle for perception systems that must act under tight real-time budgets.

Load-bearing premise

Breaking continuous 3D body-part positions into a small number of discrete slots still lets the model recover their exact distances and locations from images taken by a robot's forward-facing camera.

What would settle it

A side-by-side test on egocentric HRI video datasets that measures average 3D position error for task-critical keypoints and finds no improvement, or outright worse results, compared with standard full-body pose estimators.

Figures

Figures reproduced from arXiv: 2604.08921 by Ao Li, Yansong Tang, Yiyang Lin, Yong Deng, Yonggen Ling, Yuji Wang.

**Figure 1.** Figure 1: We introduce TAIHRI, a task-aware 3D human keypoints localization model designed for close-range human-robot interaction. robot or worn by the user—the visual observations are often subject to severe occlusion, truncation, and rapidly changing viewpoints. These factors substantially increase the difficulty of accurate 3D keypoint localization, making robust perception particularly challenging in settings … view at source ↗

**Figure 2.** Figure 2: The comparison of human pose and shape estimation and human [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The overview of CloseHRI dataset creation process. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The inference procedure of TAIHRI. Given an egocentric image with the camera intrinsics and a user instruction, we first unify the focal length by image resizing. Then the image and instruction are encoded and fused by the Vision-Language Model backbone. TAIHRI will autoregressively tell the most task-relavant keypoints names, their 2D locations in pixel-level, and their 3D locations in voxel-level in the … view at source ↗

**Figure 5.** Figure 5: The visualization comparison with SOTA methods. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: The demonstration of applications. Top: By leveraging the natural language control capability of our method, users can flexibly specify different keypoints for localization based on various interaction tasks. Bottom: Utilizing the accurately localized 3D keypoints as anchor points, we can reposition the human mesh in the global coordinate system. of supervised learning with reinforcement learning enables… view at source ↗

read the original abstract

Accurate 3D human keypoints localization is a critical technology enabling robots to achieve natural and safe physical interaction with users. Conventional 3D human keypoints estimation methods primarily focus on the whole-body reconstruction quality relative to the root joint. However, in practical human-robot interaction (HRI) scenarios, robots are more concerned with the precise metric-scale spatial localization of task-relevant body parts under the egocentric camera 3D coordinate. We propose TAIHRI, the first Vision-Language Model (VLM) tailored for close-range HRI perception, capable of understanding users' motion commands and directing the robot's attention to the most task-relevant keypoints. By quantizing 3D keypoints into a finite interaction space, TAIHRI precisely localize the 3D spatial coordinates of critical body parts by 2D keypoint reasoning via next token prediction, and seamlessly adapt to downstream tasks such as natural language control or global space human mesh recovery. Experiments on egocentric interaction benchmarks demonstrate that TAIHRI achieves superior estimation accuracy for task-critical body parts. We believe TAIHRI opens new research avenues in the field of embodied human-robot interaction. Code is available at: https://github.com/Tencent/TAIHRI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TAIHRI frames a VLM for task-specific 3D keypoint localization in close-range HRI by quantizing keypoints and using next-token prediction, but the abstract supplies no numbers or baselines to check whether the discretization actually delivers usable metric accuracy.

read the letter

The paper's main contribution is shifting 3D human keypoint work from whole-body reconstruction to task-aware localization under egocentric views. It targets the practical HRI case where a robot needs precise coordinates on the body parts relevant to a command, not a full skeleton relative to the root joint. TAIHRI does this by feeding user motion commands into a VLM, quantizing 3D keypoints into a finite space, and recovering locations through 2D reasoning cast as next-token prediction. The abstract says this lets the model adapt to downstream tasks like language control or mesh recovery, and that experiments on egocentric benchmarks show better accuracy on task-critical parts than prior methods. That framing is new in the cited literature and directly tackles a gap where general pose estimators fall short for safety-critical close interaction. The idea of tying command understanding to selective keypoint focus in one model is a clean way to make perception more actionable for robots. The soft spot is the complete absence of quantitative support in the abstract. No error metrics, no baseline comparisons, no bin sizes for the quantization, and no description of how 2D predictions lift back to metric 3D coordinates. Without those, it is impossible to tell whether the discretization step preserves the millimeter-level precision needed when a robot is physically close to a person. The stress-test note correctly flags that categorical token prediction can introduce artifacts that matter more in this setting than in standard pose estimation. The full manuscript would need to show the actual numbers and ablation on quantization resolution before the superiority claim can be assessed. This work is for people building embodied perception systems in robotics who want to explore VLM adaptations to task-driven rather than generic 3D estimation. A reader already working on HRI benchmarks or VLM fine-tuning could extract the high-level architecture and try the released code. It deserves peer review because the problem statement is grounded and the technical direction is distinct, even if the current evidence is thin. A referee can require the missing experiments and error analysis without rejecting the premise outright.

Referee Report

2 major / 1 minor

Summary. The paper introduces TAIHRI, the first VLM tailored for close-range HRI perception. It quantizes 3D keypoints into a finite interaction space so that 2D keypoint reasoning can be performed via next-token prediction, recovering precise metric-scale 3D coordinates of task-critical body parts under egocentric views. The method is claimed to support downstream tasks such as natural language control and global human mesh recovery, with experiments on egocentric interaction benchmarks demonstrating superior estimation accuracy for task-critical parts.

Significance. If the quantitative claims hold, the work could meaningfully advance embodied HRI by shifting emphasis from whole-body reconstruction to task-aware, metric-precise localization of relevant keypoints. Code availability at the cited GitHub repository is a clear strength that aids reproducibility. However, the significance is tempered by the absence of any reported error metrics, baseline comparisons, or resolution details that would confirm the discretization step preserves the required precision.

major comments (2)

[Abstract] Abstract: the central claim that 'Experiments on egocentric interaction benchmarks demonstrate that TAIHRI achieves superior estimation accuracy for task-critical body parts' is stated without any numerical results, tables, MPJPE values, or baseline comparisons. This absence directly undermines evaluation of the primary contribution.
[Method] Method (quantization and next-token prediction paragraph): quantizing 3D keypoints into a finite interaction space and recovering coordinates via categorical VLM token prediction inherently replaces continuous regression with discrete classification. No bin resolution, quantization error bounds, or 2D-to-3D lifting analysis is provided, leaving open whether millimeter-scale accuracy can be maintained in close-range egocentric HRI where small discretization artifacts dominate.

minor comments (1)

[Abstract] The abstract would be strengthened by including one or two key quantitative results (e.g., error reduction on task-critical joints) to allow readers to immediately gauge the claimed improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight opportunities to strengthen the presentation of quantitative results and methodological details. We address each point below and have revised the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'Experiments on egocentric interaction benchmarks demonstrate that TAIHRI achieves superior estimation accuracy for task-critical body parts' is stated without any numerical results, tables, MPJPE values, or baseline comparisons. This absence directly undermines evaluation of the primary contribution.

Authors: We agree that the abstract claim would be more compelling with explicit quantitative support. The revised manuscript updates the abstract to include representative MPJPE values for task-critical keypoints along with direct comparisons to the baselines evaluated in the experiments section. revision: yes
Referee: [Method] Method (quantization and next-token prediction paragraph): quantizing 3D keypoints into a finite interaction space and recovering coordinates via categorical VLM token prediction inherently replaces continuous regression with discrete classification. No bin resolution, quantization error bounds, or 2D-to-3D lifting analysis is provided, leaving open whether millimeter-scale accuracy can be maintained in close-range egocentric HRI where small discretization artifacts dominate.

Authors: We acknowledge the need for explicit characterization of the discretization. The revised method section now includes a dedicated paragraph specifying the bin resolution of the finite interaction space, deriving quantization error bounds, and presenting a 2D-to-3D lifting error propagation analysis. These additions confirm that the chosen discretization preserves the millimeter-scale precision required for close-range egocentric HRI on task-critical keypoints. revision: yes

Circularity Check

0 steps flagged

No circularity; TAIHRI is an independent VLM application validated on external benchmarks

full rationale

The paper's derivation consists of proposing a quantization of 3D keypoints into a finite interaction space followed by VLM-based 2D keypoint reasoning via next-token prediction, with the central claim of superior task-critical accuracy supported by experiments on egocentric interaction benchmarks. No equations or steps reduce by construction to the inputs (no self-definitional loops, no fitted parameters renamed as predictions, no load-bearing self-citations, and no ansatz smuggled via prior work). The approach is presented as a new application rather than a closed derivation, making the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that standard VLM next-token prediction can be repurposed for precise 3D metric localization after quantization, which is treated as a domain assumption rather than derived within the paper.

axioms (1)

domain assumption Vision-language models can effectively map 2D keypoint reasoning to accurate 3D metric coordinates via next-token prediction after quantization into a finite space.
This is the core mechanism described in the abstract for achieving localization.

pith-pipeline@v0.9.0 · 5539 in / 1265 out tokens · 78431 ms · 2026-05-10T17:30:50.345034+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 11 canonical work pages · 4 internal anchors

[1]

Advances in neural information processing systems35, 23716– 23736 (2022)

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716– 23736 (2022)

2022
[2]

European Conference on Computer Vision (2024)

Baradel, F., Armando, M., Galaaoui, S., Brégier, R., Weinzaepfel, P., Rogez, G., Lucas, T.: Multi-HMR: Multi-person whole-body human mesh recovery in a single shot. European Conference on Computer Vision (2024)

2024
[3]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Black, M.J., Patel, P., Tesch, J., Yang, J.: BEDLAM: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8726–8737 (2023)

2023
[4]

Cai, C.-F

Cai, Z., Yeh, C.F., Xu, H., Liu, Z., Meyer, G., Lei, X., Zhao, C., Li, S.W., Chandra, V., Shi, Y.: Depthlm: Metric depth from vision language models. arXiv preprint arXiv:2509.25413 (2025)

work page arXiv 2025
[5]

Pix2seq: A language modeling framework for object detection

Chen, T., Saxena, S., Li, L., Fleet, D.J., Hinton, G.: Pix2seq: A language modeling framework for object detection. arXiv preprint arXiv:2109.10852 (2021)

work page arXiv 2021
[6]

In: CVPR (2022)

Choi, H., Moon, G., Park, J., Lee, K.M.: Learning to estimate robust 3D human mesh from in-the-wild crowded scenes. In: CVPR (2022)

2022
[7]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Feng, D., Guo, P., Peng, E., Zhu, M., Yu, W., Wang, P.: Posellava: Pose centric multimodal llm for fine-grained 3d pose manipulation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 2951–2959 (2025)

2025
[9]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

Feng,Y.,Lin,J.,Dwivedi,S.K.,Sun,Y.,Patel,P.,Black,M.J.:ChatPose:Chatting about 3D human pose. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

2024
[10]

arXiv preprint arXiv:2403.11111 (2024)

Ge, Y., Wang, W., Chen, Y., Chen, H., Shen, C.: 3d human reconstruction in the wild with synthetic data using generative models. arXiv preprint arXiv:2403.11111 (2024)

work page arXiv 2024
[11]

Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (2023)

Goel, S., Pavlakos, G., Rajasegaran, J., Kanazawa, A., Malik, J.: Reconstructing and tracking humans with transformers. Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (2023)

2023
[12]

IEEE Transactions on Pattern Analysis and Machine Intelligence36(7), 1325–1339 (2013)

Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence36(7), 1325–1339 (2013)

2013
[13]

Detect anything via next point prediction,

Jiang, Q., Huo, J., Chen, X., Xiong, Y., Zeng, Z., Chen, Y., Ren, T., Yu, J., Zhang, L.: Detect anything via next point prediction. arXiv preprint arXiv:2510.12798 (2025)

work page arXiv 2025
[14]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7122–7131 (2018) 16 Li et al

2018
[15]

Advances in Neural Information Process- ing Systems37, 107270–107285 (2024)

Khirodkar,R.,Song,J.T.,Cao,J.,Luo,Z.,Kitani,K.:Harmony4d:Avideodataset for in-the-wild close human interactions. Advances in Neural Information Process- ing Systems37, 107270–107285 (2024)

2024
[16]

In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

Khirodkar, R., Tripathi, S., Kitani, K.: Occluded human mesh recovery. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1715–1725 (2022)

2022
[17]

In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision

Kocabas, M., Huang, C.H.P., Hilliges, O., Black, M.J.: PARE: Part attention re- gressor for 3D human body estimation. In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision. pp. 11127–11137 (2021)

2021
[18]

Scorehoi: Physically plausible reconstruction of human- object interaction via score-guided diffusion.arXiv preprint arXiv:2509.07920,

Li, A., Liu, J., Zhu, Y., Tang, Y.: Scorehoi: Physically plausible reconstruc- tion of human-object interaction via score-guided diffusion. arXiv preprint arXiv:2509.07920 (2025)

work page arXiv 2025
[19]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Li, J., Yang, Z., Wang, X., Ma, J., Zhou, C., Yang, Y.: Jotr: 3d joint contrastive learning with transformers for occluded human mesh recovery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9110–9121 (2023)

2023
[20]

In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR)

Li,R.,Li,S.,Kong,L.,Yang,X.,Liang,J.:Seeground:Seeandgroundforzero-shot open-vocabulary 3d visual grounding. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR). pp. 3707–3717 (June 2025)

2025
[21]

In: European Conference on Computer Vision

Li, Z., Liu, J., Zhang, Z., Xu, S., Yan, Y.: CLIFF: Carrying location information in full frames into human pose and shape estimation. In: European Conference on Computer Vision. pp. 590–606. Springer (2022)

2022
[22]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J.H., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

work page internal anchor Pith review arXiv 2025
[23]

In: Proceedings of the Computer Vision and Pattern Recognition Con- ference

Lin, J., Feng, Y., Liu, W., Black, M.J.: Chathuman: Chatting about 3d humans with tools. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 8150–8161 (2025)

2025
[24]

ACM TOG34(6), 1–16 (2015)

Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: A skinned multi-person linear model. ACM TOG34(6), 1–16 (2015)

2015
[25]

arXiv preprint arXiv:25xx.xxxxx (2025)

Man, Y., et al.: Locateanything3d: Vision-language 3d detection with chain-of- sight. arXiv preprint arXiv:25xx.xxxxx (2025)

2025
[26]

In: 3D Vision (3DV), 2017 Fifth International Conference on

Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., Theobalt, C.: Monocular 3d human pose estimation in the wild using improved cnn supervision. In: 3D Vision (3DV), 2017 Fifth International Conference on. IEEE (2017)

2017
[27]

In: Pro- ceedings of the ieee/cvf conference on computer vision and pattern recognition

Meyer, G.P.: An alternative probabilistic interpretation of the huber loss. In: Pro- ceedings of the ieee/cvf conference on computer vision and pattern recognition. pp. 5261–5269 (2021)

2021
[28]

OpenAI: Gpt-4v system card.https://openai.com/index/gpt-4v-system-card/ (2025), accessed: 2026-01-24

2025
[29]

OpenAI: Introducing gpt-5.https://openai.com/index/introducing- gpt- 5/ (2025), accessed: 2026-01-24

2025
[30]

International Conference on 3D Vision (3DV) (2025)

Patel, P., Black, M.J.: Camerahmr: Aligning people with perspective. International Conference on 3D Vision (3DV) (2025)

2025
[31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Patel, P., Huang, C.H.P., Tesch, J., Hoffmann, D.T., Tripathi, S., Black, M.J.: AGORA: Avatars in geography optimized for regression analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13468–13478 (2021) TAIHRI 17

2021
[32]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3D hands, face, and body from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10975–10985 (2019)

2019
[33]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: HU- MOR: 3D human motion model for robust pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11488–11499 (2021)

2021
[34]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Su, C., Ma, X., Su, J., Wang, Y.: Sat-hmr: Real-time multi-person 3d mesh es- timation via scale-adaptive tokens. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 16796–16806 (2025)

2025
[36]

Tesch, J., Becherini, G., Achar, P., Yiannakidis, A., Kocabas, M., Patel, P., Black, M.J.: Bedlam2.0: Synthetic humans and cameras in motion (2025)

2025
[37]

In: Proceedings of the IEEE/CVF international conference on computer vision

Wang, W., Ge, Y., Mei, H., Cai, Z., Sun, Q., Wang, Y., Shen, C., Yang, L., Komura, T.: Zolly: Zoom focal length correctly for perspective-distorted human mesh recon- struction. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3925–3935 (2023)

2023
[38]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, Y., Sun, Y., Patel, P., Daniilidis, K., Black, M.J., Kocabas, M.: Prompthmr: Promptable human mesh recovery. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1148–1159 (2025)

2025
[39]

arXiv preprint arXiv:2512.06373 (2025)

Wang, Y., Liu, W., Niu, J., Zhang, H., Tang, Y.: Vg-refiner: Towards tool-refined referring grounded reasoning via agentic reinforcement learning. arXiv preprint arXiv:2512.06373 (2025)

work page arXiv 2025
[40]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, Y., Xu, H., Liu, Y., Li, J., Tang, Y.: Sam2-love: Segment anything model 2 in language-aided audio-visual scenes. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 28932–28941 (2025)

2025
[41]

N3d-vlm: Native 3d grounding enables accurate spatial reasoning in vision-language models.arXiv preprint arXiv:2512.16561, 2025

Wang, Y., Ke, L., Zhang, B., Qu, T., Yu, H., Huang, Z., Yu, M., Xu, D., Yu, D.: N3d-vlm: Native 3d grounding enables accurate spatial reasoning in vision- language models. arXiv preprint arXiv:2512.16561 (2025)

work page arXiv 2025
[42]

Advances in neural information processing systems35, 24824–24837 (2022)

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

2022
[43]

In: ECCV (2022)

Xie, X., Bhatnagar, B.L., Pons-Moll, G.: CHORE: Contact, human and object reconstruction from a single RGB image. In: ECCV (2022)

2022
[44]

In: CVPR (2023)

Xie, X., Bhatnagar, B.L., Pons-Moll, G.: Visibility aware human-object interaction tracking from single RGB camera. In: CVPR (2023)

2023
[45]

Advances in Neural Information Processing Systems 35, 38571–38584 (2022)

Xu,Y.,Zhang,J.,Zhang,Q.,Tao,D.:VITPose:Simplevisiontransformerbaselines for human pose estimation. Advances in Neural Information Processing Systems 35, 38571–38584 (2022)

2022
[46]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

arXiv preprint; identifier to be added (2025)

Yang, X., Kukreja, D., Pinkus, D., Sagar, A., Fan, T., Park, J., Shin, S., Cao, J., Liu, J., Ugrinovic, N., Feiszli, M., Malik, J., Dollar, P., Kitani, K.: Sam 3d body: Robust full-body human mesh recovery. arXiv preprint; identifier to be added (2025)

2025
[48]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Ye, V., Pavlakos, G., Malik, J., Kanazawa, A.: Decoupling human and camera motion from videos in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21222–21232 (2023) 18 Li et al

2023
[49]

In: European Conference on Computer Vision

Zhang, S., Ma, Q., Zhang, Y., Qian, Z., Kwon, T., Pollefeys, M., Bogo, F., Tang, S.: EgoBody: Human body shape and motion of interacting people from head- mounted devices. In: European Conference on Computer Vision. pp. 180–200. Springer (2022)

2022
[50]

In: CVPR (2024)

Zhu, Y., Li, A., Tang, Y., Zhao, W., Zhou, J., Lu, J.: Dpmesh: Exploiting diffusion prior for occluded human mesh recovery. In: CVPR (2024)

2024