Recognition: 2 theorem links
· Lean TheoremHCSG: Human-Centric Semantic-Geometric Reasoning for Vision-Language Navigation
Pith reviewed 2026-05-14 18:01 UTC · model grok-4.3
The pith
HCSG lets robots navigate dynamic spaces by predicting human movements and understanding their intentions through combined geometry and language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HCSG provides a human-centric framework for VLN by introducing a unified Human Understanding Module that synergizes geometric forecasting of human pose and trajectory with semantic interpretation via a VLM to generate natural language descriptions of human actions and intentions. These representations are fused into the agent's topological map for instruction-conditioned planning, supported by a social distance loss, resulting in improved performance on the HA-VLNCE benchmark.
What carries the argument
The unified Human Understanding Module that combines geometric forecasting of poses and trajectories with VLM-generated semantic descriptions of intentions, fused into a topological map.
Load-bearing premise
The unified module can reliably predict accurate human poses, trajectories, and intention descriptions that improve planning in unseen real-world dynamic scenes.
What would settle it
Running the system in a controlled indoor environment with unpredictable pedestrian movements and measuring whether success rate stays above baseline levels or collision rate remains reduced.
Figures
read the original abstract
VLN has achieved remarkable progress by scaling data and model capacity. However, the assumption of a static environment breaks down in real-world indoor scenarios, where robots inevitably encounter dynamic pedestrians. Existing human-aware approaches typically treat humans merely as moving obstacles based on implicit visual cues, lacking the explicit reasoning required to interpret human intentions or maintain social norms. To address this, we propose HCSG, the first human-centric framework for VLN. This framework provides a robust foundation for safe, socially intelligent navigation in dynamic human-robot environments that shifts the paradigm from passive collision avoidance to active human behavior understanding. Specifically, HCSG introduces a unified Human Understanding Module that synergizes two key capabilities: (i) geometric forecasting, which predicts human pose and trajectory to anticipate future motion dynamics; and (ii) semantic interpretation, which leverages a Vision-Language Model (VLM) to generate natural language descriptions of human actions and intentions. These semantic-geometric representations are fused into the agent's topological map for instruction-conditioned planning. Furthermore, a social distance loss is introduced to enforce socially compliant interaction distances. Extensive experiments on the HA-VLNCE benchmark demonstrate that HCSG significantly outperforms state-of-the-art methods, achieving a 14% improvement in Success Rate and a 34% reduction in Collision Rate. Our project can be seen at https://haoxuanxu1024.github.io/HCSG/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes HCSG, the first human-centric framework for Vision-Language Navigation (VLN) in dynamic indoor environments. It introduces a unified Human Understanding Module that combines geometric forecasting of human poses and trajectories with semantic interpretation via a Vision-Language Model (VLM) to generate natural-language descriptions of human actions and intentions. These representations are fused into the agent's topological map for instruction-conditioned planning, with an added social distance loss to enforce compliant interaction distances. Experiments on the HA-VLNCE benchmark report a 14% improvement in Success Rate and 34% reduction in Collision Rate over state-of-the-art methods.
Significance. If the central claims hold after detailed verification, this work would be significant for shifting VLN from static-environment assumptions and passive obstacle avoidance toward explicit, active human behavior understanding. It addresses a practical gap in real-world robotics by integrating semantic and geometric cues, potentially improving safety and social compliance in dynamic scenes. The reported benchmark gains on HA-VLNCE suggest tangible advances, though their attribution to the claimed synergy remains to be substantiated.
major comments (4)
- [Abstract / Methods] Abstract and Methods: The fusion of semantic-geometric representations into the topological map is described only at a high level ('these semantic-geometric representations are fused'), with no equations, pseudocode, or architectural diagram specifying the integration operation (e.g., concatenation, attention, or learned weighting). This mechanism is load-bearing for the central claim that the synergy enables superior instruction-conditioned planning.
- [Results] Results: No ablation studies are reported that isolate the contribution of the fusion step versus the social distance loss or the underlying VLN backbone. Without such controls, the 14% SR and 34% CR gains cannot be confidently attributed to the proposed Human Understanding Module rather than stronger base components.
- [Methods] Methods: The paper provides no details on the internal architecture of the unified Human Understanding Module, including how geometric forecasts (pose/trajectory prediction) are combined with VLM-generated descriptions, nor any validation metrics for prediction accuracy in unseen dynamic scenes.
- [Experiments] Experiments: The abstract reports benchmark gains but omits baseline details, error bars, statistical significance tests, and the exact interaction between the social distance loss and the fused representations, preventing verification of the soundness of the 14% and 34% improvements.
minor comments (1)
- [Abstract] The project website link is provided, but the manuscript does not indicate whether code, models, or the HA-VLNCE benchmark splits will be released to support reproducibility.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing VLN topological maps can be extended with human pose/trajectory forecasts and VLM-generated language descriptions without breaking instruction-conditioned planning.
invented entities (1)
-
Human Understanding Module
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
semantic-geometric representations are fused into the agent's topological map ... Social Distance Loss ... Ltotal = Lpose + Ltraj + Lcoll + Lprox + Lnav (Eq. 13)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
BEVBert: Multimodal map pre-training for language-guided navigation,
D. An, Y . Qi, Y . Li, Y . Huang, L. Wang, T. Tan, and J. Shao, “BEVBert: Multimodal map pre-training for language-guided navigation,”Proceed- ings of the IEEE/CVF International Conference on Computer Vision, 2023
2023
-
[2]
ETPNav: Evolving topological planning for vision-language navigation in continuous environments,
D. An, H. Wang, W. Wang, Z. Wang, Y . Huang, K. He, and L. Wang, “ETPNav: Evolving topological planning for vision-language navigation in continuous environments,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024
2024
-
[3]
Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments,
P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. S ¨underhauf, I. Reid, S. Gould, and A. Van Den Hengel, “Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3674–3683
2018
-
[4]
Room-Across- Room: Multilingual Vision-and-Language Navigation with Dense Spa- tiotemporal Grounding,
A. Ku, P. Anderson, R. Patel, E. Ie, and J. Baldridge, “Room-Across- Room: Multilingual Vision-and-Language Navigation with Dense Spa- tiotemporal Grounding,” inProc. Conf. Empirical Methods Nat. Lang. Process. (EMNLP), Nov. 2020, pp. 4392–4412
2020
-
[5]
Beyond the Nav-Graph: Vision-and-language navigation in continuous environ- ments,
J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee, “Beyond the Nav-Graph: Vision-and-language navigation in continuous environ- ments,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 104–120
2020
-
[6]
MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation,
J. Chen, B. Lin, R. Xu, Z. Chai, X. Liang, and K.-Y . K. Wong, “MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024
2024
-
[7]
To boost zero- shot generalization for embodied reasoning with vision-language pre- training,
K. Su, X. Zhang, S. Zhang, J. Zhu, and B. Zhang, “To boost zero- shot generalization for embodied reasoning with vision-language pre- training,”IEEE Transactions on Image Processing, vol. 33, pp. 5370– 5381, 2024
2024
-
[8]
NavGPT: Explicit reasoning in vision- and-language navigation with large language models,
G. Zhou, Y . Hong, and Q. Wu, “NavGPT: Explicit reasoning in vision- and-language navigation with large language models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, 2024, pp. 7641–7649
2024
-
[9]
Holistic lstm for pedestrian trajectory prediction,
R. Quan, L. Zhu, Y . Wu, and Y . Yang, “Holistic lstm for pedestrian trajectory prediction,”IEEE transactions on image processing, vol. 30, pp. 3229–3239, 2021
2021
-
[10]
Y . Dong, F. Wu, Q. He, H. Li, M. Li, Z. Cheng, Y . Zhou, J. Sun, Q. Dai, Z.-Q. Chenget al., “HA-VLN: A benchmark for human-aware navigation in discrete-continuous environments with dynamic multi- human interactions, real-world validation, and an open leaderboard,” arXiv preprint arXiv:2503.14229, 2025
-
[11]
Habitat 2.0: Training home assistants to rearrange their habitat,
A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y . Zhao, J. Turner, N. Maestre, M. Mukadam, D. Chaplot, O. Maksymets, A. Gokaslan, V . V ondrus, S. Dharur, F. Meier, W. Galuba, A. Chang, Z. Kira, V . Koltun, J. Malik, M. Savva, and D. Batra, “Habitat 2.0: Training home assistants to rearrange their habitat,” inAdvances in Neural Information Processing Sys...
2021
-
[12]
Habitat: A Platform for Embodied AI Research,
M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik, D. Parikh, and D. Batra, “Habitat: A Platform for Embodied AI Research,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019
2019
-
[13]
Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots,
X. Puig, E. Undersander, A. Szot, M. D. Cote, R. Partsey, J. Yang, R. Desai, A. W. Clegg, M. Hlavac, T. Min, T. Gervet, V . V ondruˇs, V .-P. Berges, J. Turner, O. Maksymets, Z. Kira, M. Kalakrishnan, J. Malik, D. S. Chaplot, U. Jain, D. Batra, A. Rai, and R. Mottaghi, “Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots,” inInternational Conference ...
2024
-
[14]
DRAGON: A dialogue-based robot for assistive navigation with visual language grounding,
S. Liu, A. Hasan, K. Hong, R. Wang, P. Chang, Z. Mizrachi, J. Lin, D. L. McPherson, W. A. Rogers, and K. Driggs-Campbell, “DRAGON: A dialogue-based robot for assistive navigation with visual language grounding,”IEEE Robotics and Automation Letters, vol. 9, no. 4, pp. 3712–3719, 2024
2024
-
[15]
CoRI: Communication of robot intent for physical human-robot interaction,
J. Wang, E. B. K ¨uc ¸¨uktabak, R. S. Zarrin, and Z. Erickson, “CoRI: Communication of robot intent for physical human-robot interaction,” in9th Annual Conference on Robot Learning, 2025
2025
-
[16]
Social-LLaVa: Enhancing robot navigation through human-language reasoning in social spaces,
A. Payandeh, D. Song, M. Nazeri, J. Liang, P. Mukherjee, A. H. Raj, Y . Kong, D. Manocha, and X. Xiao, “Social-LLaVa: Enhancing robot navigation through human-language reasoning in social spaces,”arXiv preprint arXiv:2501.09024, 2024
-
[17]
VLM-Social-Nav: Socially aware robot navigation through scoring using vision-language models,
D. Song, J. Liang, A. Payandeh, A. H. Raj, X. Xiao, and D. Manocha, “VLM-Social-Nav: Socially aware robot navigation through scoring using vision-language models,”IEEE Robotics and Automation Letters, vol. 10, no. 1, pp. 508–515, 2025
2025
-
[18]
Bridging the gap be- tween learning in discrete and continuous environments for vision- and-language navigation,
Y . Hong, Z. Wang, Q. Wu, and S. Gould, “Bridging the gap be- tween learning in discrete and continuous environments for vision- and-language navigation,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 15 418–15 428
2022
-
[19]
Neighbor-view enhanced model for vision and language navigation,
D. An, Y . Qi, Y . Huang, Q. Wu, L. Wang, and T. Tan, “Neighbor-view enhanced model for vision and language navigation,” inProceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 5101–5109. 11
2021
-
[20]
Unbiased directed object attention graph for object navigation,
R. Dang, Z. Shi, L. Wang, Z. He, C. Liu, and Q. Chen, “Unbiased directed object attention graph for object navigation,” inProceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 3617–3627
2022
-
[21]
A multilevel atten- tion network with sub-instructions for continuous vision-and-language navigation,
Z. He, L. Wang, S. Li, Q. Yan, C. Liu, and Q. Chen, “A multilevel atten- tion network with sub-instructions for continuous vision-and-language navigation,”Applied Intelligence, vol. 55, no. 7, 2025
2025
-
[22]
VLN⟳BERT: A recurrent vision-and-language bert for navigation,
Y . Hong, Q. Wu, Y . Qi, C. Rodriguez-Opazo, and S. Gould, “VLN⟳BERT: A recurrent vision-and-language bert for navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1643–1653
2021
-
[23]
Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation,
X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y .-F. Wang, W. Y . Wang, and L. Zhang, “Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6629–6638
2019
-
[24]
P 3Nav: End-to-end perception, prediction and planning for vision-and-language navigation,
T. Li, W. Chen, H. Xu, X. Zheng, and H. Li, “P 3Nav: End-to-end perception, prediction and planning for vision-and-language navigation,” arXiv preprint arXiv:2603.17459, 2026
-
[25]
EnvEdit: Environment editing for vision- and-language navigation,
J. Li, H. Tan, and M. Bansal, “EnvEdit: Environment editing for vision- and-language navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 407–15 417
2022
-
[26]
Scaling data generation in vision-and-language navigation,
Z. Wang, J. Li, Y . Hong, Y . Wang, Q. Wu, M. Bansal, S. Gould, H. Tan, and Y . Qiao, “Scaling data generation in vision-and-language navigation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 12 009–12 020
2023
-
[27]
Speaker- follower models for vision-and-language navigation,
D. Fried, R. Hu, V . Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell, “Speaker- follower models for vision-and-language navigation,”Advances in Neu- ral Information Processing Systems, vol. 31, 2018
2018
-
[28]
Learning to Navigate Unseen Environ- ments: Back Translation with Environmental Dropout,
H. Tan, L. Yu, and M. Bansal, “Learning to Navigate Unseen Environ- ments: Back Translation with Environmental Dropout,” inConference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, pp. 2610–2621
2019
-
[29]
Enhancing vision-language navigation with multimodal event knowledge from real- world indoor tour videos,
H. Xu, T. Li, W. Chen, Y . Liu, X. Zuo, Y . Song, and H. Li, “Enhancing vision-language navigation with multimodal event knowledge from real- world indoor tour videos,” 2026
2026
-
[30]
AirBERT: In-domain pretraining for vision-and-language navigation,
P.-L. Guhur, M. Tapaswi, S. Chen, I. Laptev, and C. Schmid, “AirBERT: In-domain pretraining for vision-and-language navigation,” inIEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 1614– 1623
2021
-
[31]
Towards learning a generic agent for vision-and-language navigation via pre-training,
W. Hao, C. Li, X. Li, L. Carin, and J. Gao, “Towards learning a generic agent for vision-and-language navigation via pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13 137–13 146
2020
-
[32]
Transferable representation learning in vision-and-language navigation,
H. Huang, V . Jain, H. Mehta, A. Ku, G. Magalhaes, J. Baldridge, and E. Ie, “Transferable representation learning in vision-and-language navigation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7404–7413
2019
-
[33]
Improving vision-and-language navigation with image-text pairs from the web,
A. Majumdar, A. Shrivastava, S. Lee, P. Anderson, D. Parikh, and D. Batra, “Improving vision-and-language navigation with image-text pairs from the web,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 259–274
2020
-
[34]
HOP+: History- enhanced and order-aware pre-training for vision-and-language naviga- tion,
Y . Qiao, Y . Qi, Y . Hong, Z. Yu, P. Wang, and Q. Wu, “HOP+: History- enhanced and order-aware pre-training for vision-and-language naviga- tion,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 7, pp. 8524–8537, 2023
2023
-
[35]
ThinkMat- ter: Panoramic-aware instructional semantics for monocular vision-and- language navigation,
G. Dai, S. Wang, H. Zhao, B. Zhu, Q. Sun, and X. Shu, “ThinkMat- ter: Panoramic-aware instructional semantics for monocular vision-and- language navigation,”IEEE Transactions on Image Processing, 2026
2026
-
[36]
LangLoc: Language-driven localization via formatted spatial description genera- tion,
W. Shi, C. Chen, K. Li, Y . Xiong, X. Cao, and Z. Zhou, “LangLoc: Language-driven localization via formatted spatial description genera- tion,”IEEE Transactions on Image Processing, 2025
2025
-
[37]
Think Global, Act Local: Dual-scale graph transformer for vision-and-language navigation,
S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev, “Think Global, Act Local: Dual-scale graph transformer for vision-and-language navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 537–16 547
2022
-
[38]
GridMM: Grid memory map for vision-and-language navigation,
Z. Wang, X. Li, J. Yang, Y . Liu, and S. Jiang, “GridMM: Grid memory map for vision-and-language navigation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 625–15 636
2023
-
[39]
V olumetric environment representation for vision-language navigation,
R. Liu, W. Wang, and Y . Yang, “V olumetric environment representation for vision-language navigation,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024, pp. 16 317– 16 328
2024
-
[40]
Com- prehensive attribute prediction learning for person search by language,
K. Niu, L. Huang, Y . Long, Y . Huang, L. Wang, and Y . Zhang, “Com- prehensive attribute prediction learning for person search by language,” IEEE Transactions on Image Processing, vol. 33, pp. 1990–2003, 2024
1990
-
[41]
TrajFine: Predicted trajectory refinement for pedestrian trajectory fore- casting,
K.-L. Wang, L.-W. Tsao, J.-C. Wu, H.-H. Shuai, and W.-H. Cheng, “TrajFine: Predicted trajectory refinement for pedestrian trajectory fore- casting,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024, pp. 4483–4492
2024
-
[42]
PoseScript: 3d human poses from natural language,
G. Delmas, P. Weinzaepfel, T. Lucas, F. Moreno-Noguer, and G. Rogez, “PoseScript: 3d human poses from natural language,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 346–362
2022
-
[43]
MotionLLM: Understanding human behaviors from human motions and videos,
L.-H. Chen, S. Lu, A. Zeng, H. Zhang, B. Wang, R. Zhang, and L. Zhang, “MotionLLM: Understanding human behaviors from human motions and videos,”IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, pp. 1–15, 2025
2025
-
[44]
Human-aware vision-and-language navigation: Bridging simulation to reality with dynamic human interac- tions,
H. Li, M. Li, Z.-Q. Cheng, Y . Dong, Y . Zhou, J.-Y . He, Q. Dai, T. Mitamura, and A. G. Hauptmann, “Human-aware vision-and-language navigation: Bridging simulation to reality with dynamic human interac- tions,”Advances in Neural Information Processing Systems, vol. 37, pp. 119 411–119 442, 2024
2024
-
[45]
Decouple ego-view motions for predicting pedestrian trajectory and intention,
Z. Zhang, Z. Ding, and R. Tian, “Decouple ego-view motions for predicting pedestrian trajectory and intention,”IEEE Transactions on Image Processing, vol. 33, pp. 4716–4727, 2024
2024
-
[46]
Rethinking Social Robot Navigation: Leverag- ing the best of two worlds,
A. H. Raj, Z. Hu, H. Karnan, R. Chandra, A. Payandeh, L. Mao, P. Stone, J. Biswas, and X. Xiao, “Rethinking Social Robot Navigation: Leverag- ing the best of two worlds,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 16 330–16 337
2024
-
[47]
Social elastic band with prediction and anticipation: Enhancing real-time path tra- jectory optimization for socially aware robot navigation,
G. P ´erez, N. Zapata-Cornejo, P. Bustos, and P. N ´u˜nez, “Social elastic band with prediction and anticipation: Enhancing real-time path tra- jectory optimization for socially aware robot navigation,”International Journal of Social Robotics, vol. 17, no. 10, pp. 2041–2063, 2025
2041
-
[48]
SICNav: Safe and interactive crowd navigation using model predictive control and bilevel optimization,
S. Samavi, J. R. Han, F. Shkurti, and A. P. Schoellig, “SICNav: Safe and interactive crowd navigation using model predictive control and bilevel optimization,”IEEE Transactions on Robotics, vol. 41, p. 801–818, 2025
2025
-
[49]
SCSV: Spatial-temporal consistent dynamic 3d scene generation from sparse views,
J. Li, J. He, W. Liu, T. Huang, S. Zhou, J. Ma, H. Wang, and H. Li, “SCSV: Spatial-temporal consistent dynamic 3d scene generation from sparse views,”IEEE Transactions on Image Processing, 2026
2026
-
[50]
On legible and pre- dictable robot navigation in multi-agent environments,
J.-L. Bastarache, C. Nielsen, and S. L. Smith, “On legible and pre- dictable robot navigation in multi-agent environments,” in2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 5508–5514
2023
-
[51]
Crowd-aware socially compliant robot navigation via deep reinforcement learning,
B. Xue, M. Gao, C. Wang, Y . Cheng, and F. Zhou, “Crowd-aware socially compliant robot navigation via deep reinforcement learning,” International Journal of Social Robotics, vol. 16, no. 1, pp. 197–209, 2024
2024
-
[52]
Socially aware robot crowd navigation via online uncertainty-driven risk adaptation,
Z. Sun, X. Diao, Y . Wang, B.-K. Zhu, and J. Wang, “Socially aware robot crowd navigation via online uncertainty-driven risk adaptation,” arXiv preprint arXiv:2506.14305, 2025
-
[53]
From cognition to precognition: A future-aware framework for social navigation,
Z. Gong, T. Hu, R. Qiu, and J. Liang, “From cognition to precognition: A future-aware framework for social navigation,” in2025 IEEE Inter- national Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 9122–9129
2025
-
[54]
Hartley and A
R. Hartley and A. Zisserman,Multiple view geometry in computer vision. Cambridge university press, 2003
2003
-
[55]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763
2021
-
[56]
Language- aligned waypoint (law) supervision for vision-and-language navigation in continuous environments,
S. Raychaudhuri, S. Wani, S. Patel, U. Jain, and A. Chang, “Language- aligned waypoint (law) supervision for vision-and-language navigation in continuous environments,” inProceedings of the 2021 conference on empirical methods in natural language processing, 2021, pp. 4018–4028
2021
-
[57]
YOLO-Pose: Enhanc- ing yolo for multi person pose estimation using object keypoint simi- larity loss,
D. Maji, S. Nagori, M. Mathew, and D. Poddar, “YOLO-Pose: Enhanc- ing yolo for multi person pose estimation using object keypoint simi- larity loss,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 2637–2646
2022
-
[58]
Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.