Recognition: 2 theorem links
· Lean TheoremDone, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
Pith reviewed 2026-05-15 06:20 UTC · model grok-4.3
The pith
Embodied agents can finish tasks in the world yet fail to correctly report termination, producing gaps of up to 19.7 points between world completion and benchmark success.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that terminal commitment, defined as correctly ending an episode with a verified semantic success report, is distinct from world-state completion and can be measured independently. Under VIGIL, agents receive only visual input and produce reports checked deterministically against hidden states, yielding separate W and B scores that expose up to 19.7 percentage point differences across models with comparable execution, plus persistent commitment failures even after action-feedback interventions.
What carries the argument
VIGIL evaluation protocol, which computes world-state completion (W) separately from benchmark success (B) by requiring a semantic terminal report at episode close that is verified deterministically against hidden world state using only egocentric RGB observations.
If this is right
- Models with matched world completion can still differ substantially in benchmark success because of post-attainment drift or unsupported terminal reports.
- Action feedback improves world-state completion across models but leaves terminal commitment failures intact in those that do not already ground reports in the achieved state.
- Four distinct outcome categories become measurable: missed execution, post-attainment drift, unsupported commitment, and verified success.
- Some models convert achieved states into correct reports while others with near-identical execution fail to close episodes properly.
Where Pith is reading between the lines
- Training procedures may need explicit objectives for accurate self-termination detection rather than relying solely on task execution signals.
- The separation could matter for real-world robot deployment where incorrect stopping decisions carry safety costs.
- Benchmark designers in other agent domains might adopt similar hidden-state verification to isolate execution from recognition of completion.
Load-bearing premise
A semantic terminal report can be checked deterministically against hidden world state independently of action success signals or other external cues.
What would settle it
If models that achieve identical world states on the same frozen episodes also produce identical rates of correct terminal reports, the claimed independence between world completion and self-reported termination would not hold.
Figures
read the original abstract
Standard embodied evaluations do not independently score whether an agent correctly commits to task completion at episode closure, a capacity we call terminal commitment. Behaviorally distinct failures--never completing the task, completing it but failing to stop, and reporting success without sufficient evidence--collapse into the same benchmark failure. We introduce VIGIL, an evaluation framework that makes terminal commitment independently measurable. Under VIGIL's default protocol, agents observe only egocentric RGB, receive no action-success signals, and must end each episode with a semantic report checked deterministically against hidden world state. This yields two separate scores: world-state completion (W) and benchmark success (B), where B additionally requires a correct terminal report. This decoupling makes four outcome categories distinguishable: missed execution, post-attainment drift, unsupported commitment, and verified success. Across 20 models on 1,000 frozen episodes, systems with comparable W differ by up to 19.7 pp in B: one model converts achieved states into correct reports, while another with near-identical execution drifts past the goal without closing. An action-feedback intervention further tests the separation: execution-oriented signals improve W broadly, yet commitment failures persist in models that do not already ground terminal reports in the achieved state. VIGIL provides a protocol that makes terminal commitment independently visible and scorable.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VIGIL, an evaluation framework for embodied agents that separates world-state completion (W) from benchmark success (B), where B additionally requires a correct semantic terminal report verified deterministically against hidden world state. Using only egocentric RGB observations and no action-success signals, the protocol distinguishes four outcome categories (missed execution, post-attainment drift, unsupported commitment, verified success). Experiments across 20 models on 1,000 frozen episodes show models with comparable W differing by up to 19.7 pp in B, and an action-feedback intervention improves W broadly but leaves commitment failures in some models.
Significance. If the results hold, VIGIL offers a reproducible protocol for isolating terminal commitment, a capacity that standard embodied benchmarks conflate with execution success. The empirical separation of W and B, achieved via frozen episodes and deterministic checks, provides a concrete way to diagnose post-attainment drift and unsupported reports, with direct implications for agent reliability and safety. The intervention results further demonstrate that the measures are not redundant, supporting more targeted improvements in embodied systems.
major comments (2)
- [Abstract] Abstract: the central 19.7 pp gap in B for models with comparable W is load-bearing for the decoupling claim, yet the text provides no explicit definition of 'comparable' (e.g., W range or threshold), no model identifiers, and no statistical details on the gap, preventing verification of robustness from the given information.
- [Evaluation protocol] Evaluation protocol description: the deterministic check of the semantic terminal report against hidden state is presented as independent of action-success signals, but the manuscript should clarify whether any implicit environmental cues (e.g., visual changes at termination) could still leak success information into the report generation process.
minor comments (2)
- [Abstract] Abstract: the four outcome categories are listed but not illustrated with a single concrete example; adding one short example per category would improve immediate readability.
- The manuscript should include a table or figure summarizing the exact W and B scores for the 20 models to allow readers to assess the 'comparable W' claim directly.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and recommendation for minor revision. We address the two major comments point by point below, providing clarifications drawn from the full manuscript and committing to targeted revisions that improve self-containment and transparency.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central 19.7 pp gap in B for models with comparable W is load-bearing for the decoupling claim, yet the text provides no explicit definition of 'comparable' (e.g., W range or threshold), no model identifiers, and no statistical details on the gap, preventing verification of robustness from the given information.
Authors: We agree that the abstract would benefit from greater self-containment on this point. In the full manuscript (Section 4.2 and Table 2), 'comparable W' is defined as world-state completion scores differing by at most 5 percentage points; the cited 19.7 pp B gap occurs between two models with W scores of 72.4% and 71.9% (identifiers: 'VLM-7B' and 'VLM-13B' as labeled in the results), yielding B scores of 48.3% versus 28.6% (p < 0.001 via bootstrap resampling over the 1,000 frozen episodes, 95% CI [16.2, 23.1]). We will revise the abstract to include this explicit definition, the model identifiers, and a concise statistical note. revision: yes
-
Referee: [Evaluation protocol] Evaluation protocol description: the deterministic check of the semantic terminal report against hidden state is presented as independent of action-success signals, but the manuscript should clarify whether any implicit environmental cues (e.g., visual changes at termination) could still leak success information into the report generation process.
Authors: The protocol provides agents with only egocentric RGB observations and explicitly withholds action-success signals. We acknowledge that task completion can produce observable visual changes in the RGB stream (e.g., object state transitions), which an agent's policy may use when deciding to terminate and generate its semantic report. This is perceptual evidence inherent to the embodied setting rather than an external success signal. The deterministic verification against hidden world state still enforces report accuracy independently of how the agent reached its termination decision. We will add a clarifying paragraph in the Evaluation Protocol section distinguishing these cues from prohibited action-success feedback and noting that VIGIL's W/B separation remains intact. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper introduces VIGIL as a procedurally defined evaluation protocol that separates W (world-state completion at termination) from B (benchmark success requiring correct semantic terminal report). No equations, fitted parameters, or self-citations appear in the derivation chain. The four outcome categories and reported gaps (up to 19.7 pp) follow directly from the explicit definitions of egocentric RGB input, absence of action-success signals, forced terminal report, and deterministic check against hidden state. The framework is self-contained against external benchmarks with no reduction of claims to their own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Agents observe only egocentric RGB and receive no action-success signals
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VIGIL yields two separate scores: world-state completion (W) and benchmark success (B), where B additionally requires a correct terminal report... agents observe only egocentric RGB, receive no action-success signals
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
This decoupling makes four outcome categories distinguishable: missed execution, post-attainment drift, unsupported commitment, and verified success
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Pingyue Zhang, Zihan Huang, Yue Wang, Jieyu Zhang, Letian Xue, Zihan Wang, Qineng Wang, Keshigeyan Chandrasegaran, Ruohan Zhang, Yejin Choi, et al. Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration? InICLR, 2026. See also arXiv preprint arXiv:2602.07055
-
[2]
CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning under Partial Observations
Huan-ang Gao, Zikang Zhang, Tianwei Luo, Kaisen Yang, Xinzhe Juan, Jiahao Qiu, Tianxing Chen, Bingxiang He, Hao Zhao, Hao Zhou, Shilong Liu, and Mengdi Wang. CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning under Partial Observations. InICLR, 2026
work page 2026
-
[3]
Sanketi, Grecia Salazar, Michael S
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi, Grecia Salazar, Michael S. Ryoo, et al. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic ...
work page 2023
-
[4]
arXiv preprint arXiv:2502.09560 (2025)
Rui Yang, Hanyang Chen, Junyu Zhang, et al. EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents. InICML, 2025. See also arXiv preprint arXiv:2502.09560
-
[5]
ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. InCVPR, 2020
work page 2020
-
[6]
LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents
Jae-Woo Choi, Youngwoo Yoon, Hyobin Ong, Jaehong Kim, and Minsu Jang. LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents. InICLR, 2024
work page 2024
-
[7]
GOAT- Bench: A Benchmark for Multi-Modal Lifelong Navigation
Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, and Roozbeh Mottaghi. GOAT- Bench: A Benchmark for Multi-Modal Lifelong Navigation. InCVPR, 2024
work page 2024
-
[8]
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making
Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li Erran Li, Ruohan Zhang, et al. Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making. InNeurIPS, 2024
work page 2024
-
[9]
EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents.arXiv preprint arXiv:2501.11858, 2025
Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, et al. EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents.arXiv preprint arXiv:2501.11858, 2025
-
[10]
How Foundational Skills Influence VLM-based Embodied Agents: A Native Perspective
Bo Peng, Pi Bu, Keyu Pan, Xinrun Xu, Yinxiu Zhao, Miao Chen, Yang Du, Lin Li, Jun Song, and Tong Xu. How Foundational Skills Influence VLM-based Embodied Agents: A Native Perspective. InAAAI, 2026. See also arXiv preprint arXiv:2602.20687
-
[11]
Language Models (Mostly) Know What They Know
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Vu, et al. Language Models (Mostly) Know What They Know.arXiv preprint arXiv:2207.05221, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
Allen Z. Ren, Anushri Dixit, Alexandra Bodrova, Sumeet Singh, Stephen Tu, Noah Brown, Peng Xu, Leila Takayama, Fei Xia, Jake Varley, Zhenjia Xu, Dorsa Sadigh, Andy Zeng, and Anirudha Majumdar. Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners. InCoRL, 2023
work page 2023
-
[13]
Inner Monologue: Embodied Reasoning through Planning with Language Models
Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner Monologue: Embodied Reasoning through Planning with Language Models. InCoRL, 2022. 11
work page 2022
-
[14]
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. In ICLR, 2021
work page 2021
-
[15]
VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation
Kaizhi Zheng, Xiaotong Chen, Odest Jenkins, and Xin Eric Wang. VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation. InNeurIPS, 2022
work page 2022
-
[16]
BEHA VIOR-1K: A Benchmark for Embodied AI with 1,000 Everyday Activities and Realistic Simulation
Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martin- Martin, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al. BEHA VIOR-1K: A Benchmark for Embodied AI with 1,000 Everyday Activities and Realistic Simulation. InCoRL, 2023
work page 2023
-
[17]
TEACh: Task-Driven Embodied Agents That Chat
Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan-Chen, Spandana Gella, Robinson Piramuthu, Gokhan Tur, and Dilek Hakkani-Tur. TEACh: Task-Driven Embodied Agents That Chat. InAAAI, 2022
work page 2022
-
[18]
ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments
Taewoong Kim, Cheolhong Min, Byeonghwi Kim, Jinyeon Kim, Wonje Jeung, and Jonghyun Choi. ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments. In ECCV, 2024
work page 2024
-
[19]
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani, Danny Driess, Pete Florence, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities. InCVPR, 2024
work page 2024
-
[20]
SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models
An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models. InNeurIPS, 2024
work page 2024
-
[21]
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models. InICLR, 2026. See also arXiv preprint arXiv:2506.03135
-
[22]
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence
Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence. InICLR, 2025
work page 2025
-
[23]
Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models
Xingrui Wang, Wufei Ma, Tiezheng Zhang, Celso M de Melo, Jieneng Chen, and Alan Yuille. Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models. In CVPR, 2025
work page 2025
-
[24]
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics
Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox. RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics. InCoRL, 2024
work page 2024
-
[25]
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics. InCVPR, 2025
work page 2025
-
[26]
Cambrian-S: Towards Spatial Supersensing in Video.arXiv preprint arXiv:2511.04670, 2025
Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei-Fei, and Saining Xie. Cambrian-S: Towards Spatial Supersensing in Video.arXiv preprint arXiv:2511.04670, 2025
-
[27]
VLM4D: Towards Spatiotemporal Awareness in Vision Language Models
Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Eric Xin Wang, and Achuta Kadambi. VLM4D: Towards Spatiotemporal Awareness in Vision Language Models. InICCV, 2025. 12
work page 2025
-
[28]
Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie
Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces. InCVPR, 2024
work page 2024
-
[29]
Spatial Mental Modeling from Limited Views
Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, and Li Fei-Fei. Spatial Mental Modeling from Limited Views. InICLR, 2025
work page 2025
-
[30]
AI2-THOR: An Interactive 3D Environment for Visual AI
Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. AI2-THOR: An Interactive 3D Environment for Visual AI.arXiv preprint arXiv:1712.05474, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[31]
ProcTHOR: Large-Scale Embodied AI Using Procedural Generation
Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. InNeurIPS, 2022
work page 2022
-
[32]
Google DeepMind. Gemini 3.1 Pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, 2026
work page 2026
-
[33]
Dong Guo, Faming Wu, Feida Zhu, et al. Seed1.5-VL Technical Report.arXiv preprint arXiv:2505.07062, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026
OpenAI. Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026
work page 2026
-
[35]
Introducing Claude 4.https://www.anthropic.com/news/claude-4, May 2025
Anthropic. Introducing Claude 4.https://www.anthropic.com/news/claude-4, May 2025
work page 2025
-
[36]
Qwen3.6-27B: Flagship-level coding in a 27B dense model
Qwen Team. Qwen3.6-27B: Flagship-level coding in a 27B dense model. https://qwen.ai/ blog?id=qwen3.6-27b, April 2026
work page 2026
-
[37]
Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog?id=qwen3.5, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog?id=qwen3.5, February 2026
work page 2026
-
[38]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Zhe Chen, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
MiMo-Embodied: X-Embodied Foundation Model Technical Report
Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zhang, Guang Li, Zheng Lu, et al. MiMo-Embodied: X-Embodied Foundation Model Technical Report.arXiv preprint arXiv:2511.16518, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Kimi Team, Angang Du, et al. Kimi-VL Technical Report.arXiv preprint arXiv:2504.07491, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. RoboBrain: A Unified Brain Model for Robotic Manipula- tion from Abstract to Concrete.arXiv preprint arXiv:2502.21257, 2025
-
[43]
RynnBrain: Open Embodied Foundation Models.arXiv preprint arXiv:2602.14979, 2026
Ronghao Dang, Jiayan Guo, Bohan Hou, Sicong Leng, Kehan Li, Xin Li, Jiangping Liu, Yunxuan Mao, Zhikai Wang, Yuqian Yuan, Minghao Zhu, Xiao Lin, Yang Bai, Qian Jiang, Yaxi Zhao, Minghua Zeng, Junlong Gao, Yuming Jiang, Jun Cen, Siteng Huang, Liuyi Wang, Wenqiao Zhang, Chengju Liu, Jianfei Yang, Shijian Lu, and Deli Zhao. RynnBrain: Open Embodied Foundatio...
-
[44]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, et al. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388, 2025. 13 A System Prompt Specification The system prompt is assembled programmatically from four blocks in fixed order. No privileged simulator state crosses the agent–evaluator boundary. We reproduce each block verbatim below; each benchmark ru...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Invalid-action limit:cumulative invalid actions (protocol failures + malformed actions) exceed the family-specific limit; scored asinvalid_action_limit_exceeded. C Scoring Details C.1 Dual-Metric Evaluation Each episode is evaluated under two success metrics simultaneously: •Semantic(primary): tolerant of minor imprecision in object placement or state mat...
-
[46]
LLM proposal: a language model drafts candidate tasks conditioned on scene inventories and family-specific constraints (target visibility, start-pose requirements, available intents, object cate- gories)
-
[47]
Simulator validation: each proposal is instantiated in AI2-THOR and validated for object existence, state accessibility, agent reachability, and success-condition solvability. The validation engine checks episode-contract integrity including agent initialization, scene setup consistency, success- spec type validity, and family-specific rules (e.g., SM req...
-
[48]
Human review: a human auditor reviews a stratified sample for ambiguity, instruction quality, and difficulty calibration. Manually approved episodes receive priority in pack assembly. E.2 Pack Composition The evaluation uses pack mixed_mainline_manual_balanced_1000, containing 1,000 episodes with exactly 125 per task family. These episodes are selected fr...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.