pith. machine review for the scientific record. sign in

arxiv: 2605.08747 · v3 · submitted 2026-05-09 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:20 UTC · model grok-4.3

classification 💻 cs.AI
keywords embodied agentsterminal commitmentworld completionbenchmark successVIGILself-terminationegocentric RGBtask completion
0
0 comments X

The pith

Embodied agents can finish tasks in the world yet fail to correctly report termination, producing gaps of up to 19.7 points between world completion and benchmark success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard evaluations for embodied agents collapse distinct behaviors into one failure category by not separately scoring whether an agent has finished a task and whether it correctly decides to stop and declare success. It introduces the VIGIL framework to measure these as two scores: world-state completion (W) based on the hidden environment and benchmark success (B), which additionally requires a correct semantic terminal report at episode end. Agents operate with only egocentric RGB views and no action feedback, allowing four outcome types to be distinguished. Experiments across 20 models on 1000 frozen episodes show that similar W scores can still yield large differences in B. Adding action feedback improves execution broadly but does not fix commitment problems in models that fail to ground reports in the achieved state.

Core claim

The central claim is that terminal commitment, defined as correctly ending an episode with a verified semantic success report, is distinct from world-state completion and can be measured independently. Under VIGIL, agents receive only visual input and produce reports checked deterministically against hidden states, yielding separate W and B scores that expose up to 19.7 percentage point differences across models with comparable execution, plus persistent commitment failures even after action-feedback interventions.

What carries the argument

VIGIL evaluation protocol, which computes world-state completion (W) separately from benchmark success (B) by requiring a semantic terminal report at episode close that is verified deterministically against hidden world state using only egocentric RGB observations.

If this is right

  • Models with matched world completion can still differ substantially in benchmark success because of post-attainment drift or unsupported terminal reports.
  • Action feedback improves world-state completion across models but leaves terminal commitment failures intact in those that do not already ground reports in the achieved state.
  • Four distinct outcome categories become measurable: missed execution, post-attainment drift, unsupported commitment, and verified success.
  • Some models convert achieved states into correct reports while others with near-identical execution fail to close episodes properly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training procedures may need explicit objectives for accurate self-termination detection rather than relying solely on task execution signals.
  • The separation could matter for real-world robot deployment where incorrect stopping decisions carry safety costs.
  • Benchmark designers in other agent domains might adopt similar hidden-state verification to isolate execution from recognition of completion.

Load-bearing premise

A semantic terminal report can be checked deterministically against hidden world state independently of action success signals or other external cues.

What would settle it

If models that achieve identical world states on the same frozen episodes also produce identical rates of correct terminal reports, the claimed independence between world completion and self-reported termination would not hold.

Figures

Figures reproduced from arXiv: 2605.08747 by Jie Chen, Lei Yi, Lihuang Fang, Mingxu Wang, Rui Jiang, Ying Chen, Zhifeng Gu.

Figure 1
Figure 1. Figure 1: Controlled evaluation protocol. VIGILcontains eight task families: a diagnostic tier (PG, DA, SV, VS) that isolates a single bottleneck, and a compositional tier (AI, SI, SM, CR) that combines them in multi-step interaction. All episodes use strict first-person observation and mandatory report termination. We introduce VIGIL, an evaluation framework that makes terminal commitment independently measurable. … view at source ↗
Figure 2
Figure 2. Figure 2: Evaluation pipeline. The top row illustrates an example trajectory. The bottom row summarizes the per-step interface: the agent acts from only the current egocentric RGB frame, the task instruction, and bounded dialogue history, without action-success or goal-completion feedback. A terminal report is evaluated against the hidden world-state condition by deterministic rules. The key mechanism is the termina… view at source ↗
Figure 3
Figure 3. Figure 3: Episode outcome partition for 10 anchor models [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Terminal-commitment profiles (sorted by B descending). (a) Mean belief lag: steps between first world￾goal satisfaction and the correct terminal report. When agents do report correctly, they do so within 0.9–1.9 steps of that event (panel (a) rounds each model to one decimal place). (b) Among W =1 episodes, percentages are the fraction of each primitive action type among all steps after the world goal is f… view at source ↗
Figure 5
Figure 5. Figure 5: State-verification trajectory example. The same initial observation can lead to correct closure (Gemini￾3.1-Pro [32] and GPT-5.4 [34]) or a false report (Doubao-Seed-1.8 [33]). Claude-Sonnet-4 [35] moves away from the initially visible microwave before reporting, illustrating that even atomic verification probes can fail through unnecessary action followed by an incorrect terminal judgment. 27 [PITH_FULL_… view at source ↗
Figure 6
Figure 6. Figure 6: Approach-and-interact trajectory example. [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗
read the original abstract

Standard embodied evaluations do not independently score whether an agent correctly commits to task completion at episode closure, a capacity we call terminal commitment. Behaviorally distinct failures--never completing the task, completing it but failing to stop, and reporting success without sufficient evidence--collapse into the same benchmark failure. We introduce VIGIL, an evaluation framework that makes terminal commitment independently measurable. Under VIGIL's default protocol, agents observe only egocentric RGB, receive no action-success signals, and must end each episode with a semantic report checked deterministically against hidden world state. This yields two separate scores: world-state completion (W) and benchmark success (B), where B additionally requires a correct terminal report. This decoupling makes four outcome categories distinguishable: missed execution, post-attainment drift, unsupported commitment, and verified success. Across 20 models on 1,000 frozen episodes, systems with comparable W differ by up to 19.7 pp in B: one model converts achieved states into correct reports, while another with near-identical execution drifts past the goal without closing. An action-feedback intervention further tests the separation: execution-oriented signals improve W broadly, yet commitment failures persist in models that do not already ground terminal reports in the achieved state. VIGIL provides a protocol that makes terminal commitment independently visible and scorable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VIGIL, an evaluation framework for embodied agents that separates world-state completion (W) from benchmark success (B), where B additionally requires a correct semantic terminal report verified deterministically against hidden world state. Using only egocentric RGB observations and no action-success signals, the protocol distinguishes four outcome categories (missed execution, post-attainment drift, unsupported commitment, verified success). Experiments across 20 models on 1,000 frozen episodes show models with comparable W differing by up to 19.7 pp in B, and an action-feedback intervention improves W broadly but leaves commitment failures in some models.

Significance. If the results hold, VIGIL offers a reproducible protocol for isolating terminal commitment, a capacity that standard embodied benchmarks conflate with execution success. The empirical separation of W and B, achieved via frozen episodes and deterministic checks, provides a concrete way to diagnose post-attainment drift and unsupported reports, with direct implications for agent reliability and safety. The intervention results further demonstrate that the measures are not redundant, supporting more targeted improvements in embodied systems.

major comments (2)
  1. [Abstract] Abstract: the central 19.7 pp gap in B for models with comparable W is load-bearing for the decoupling claim, yet the text provides no explicit definition of 'comparable' (e.g., W range or threshold), no model identifiers, and no statistical details on the gap, preventing verification of robustness from the given information.
  2. [Evaluation protocol] Evaluation protocol description: the deterministic check of the semantic terminal report against hidden state is presented as independent of action-success signals, but the manuscript should clarify whether any implicit environmental cues (e.g., visual changes at termination) could still leak success information into the report generation process.
minor comments (2)
  1. [Abstract] Abstract: the four outcome categories are listed but not illustrated with a single concrete example; adding one short example per category would improve immediate readability.
  2. The manuscript should include a table or figure summarizing the exact W and B scores for the 20 models to allow readers to assess the 'comparable W' claim directly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. We address the two major comments point by point below, providing clarifications drawn from the full manuscript and committing to targeted revisions that improve self-containment and transparency.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central 19.7 pp gap in B for models with comparable W is load-bearing for the decoupling claim, yet the text provides no explicit definition of 'comparable' (e.g., W range or threshold), no model identifiers, and no statistical details on the gap, preventing verification of robustness from the given information.

    Authors: We agree that the abstract would benefit from greater self-containment on this point. In the full manuscript (Section 4.2 and Table 2), 'comparable W' is defined as world-state completion scores differing by at most 5 percentage points; the cited 19.7 pp B gap occurs between two models with W scores of 72.4% and 71.9% (identifiers: 'VLM-7B' and 'VLM-13B' as labeled in the results), yielding B scores of 48.3% versus 28.6% (p < 0.001 via bootstrap resampling over the 1,000 frozen episodes, 95% CI [16.2, 23.1]). We will revise the abstract to include this explicit definition, the model identifiers, and a concise statistical note. revision: yes

  2. Referee: [Evaluation protocol] Evaluation protocol description: the deterministic check of the semantic terminal report against hidden state is presented as independent of action-success signals, but the manuscript should clarify whether any implicit environmental cues (e.g., visual changes at termination) could still leak success information into the report generation process.

    Authors: The protocol provides agents with only egocentric RGB observations and explicitly withholds action-success signals. We acknowledge that task completion can produce observable visual changes in the RGB stream (e.g., object state transitions), which an agent's policy may use when deciding to terminate and generate its semantic report. This is perceptual evidence inherent to the embodied setting rather than an external success signal. The deterministic verification against hidden world state still enforces report accuracy independently of how the agent reached its termination decision. We will add a clarifying paragraph in the Evaluation Protocol section distinguishing these cues from prohibited action-success feedback and noting that VIGIL's W/B separation remains intact. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces VIGIL as a procedurally defined evaluation protocol that separates W (world-state completion at termination) from B (benchmark success requiring correct semantic terminal report). No equations, fitted parameters, or self-citations appear in the derivation chain. The four outcome categories and reported gaps (up to 19.7 pp) follow directly from the explicit definitions of egocentric RGB input, absence of action-success signals, forced terminal report, and deterministic check against hidden state. The framework is self-contained against external benchmarks with no reduction of claims to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the procedural assumption that terminal reports can be verified against hidden state without additional signals; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Agents observe only egocentric RGB and receive no action-success signals
    Explicitly stated as the default protocol under which W and B are measured.

pith-pipeline@v0.9.0 · 5553 in / 1233 out tokens · 61303 ms · 2026-05-15T06:20:57.968014+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 8 internal anchors

  1. [1]

    Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration? InICLR, 2026

    Pingyue Zhang, Zihan Huang, Yue Wang, Jieyu Zhang, Letian Xue, Zihan Wang, Qineng Wang, Keshigeyan Chandrasegaran, Ruohan Zhang, Yejin Choi, et al. Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration? InICLR, 2026. See also arXiv preprint arXiv:2602.07055

  2. [2]

    CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning under Partial Observations

    Huan-ang Gao, Zikang Zhang, Tianwei Luo, Kaisen Yang, Xinzhe Juan, Jiahao Qiu, Tianxing Chen, Bingxiang He, Hao Zhao, Hao Zhou, Shilong Liu, and Mengdi Wang. CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning under Partial Observations. InICLR, 2026

  3. [3]

    Sanketi, Grecia Salazar, Michael S

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi, Grecia Salazar, Michael S. Ryoo, et al. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic ...

  4. [4]

    arXiv preprint arXiv:2502.09560 (2025)

    Rui Yang, Hanyang Chen, Junyu Zhang, et al. EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents. InICML, 2025. See also arXiv preprint arXiv:2502.09560

  5. [5]

    ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

    Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. InCVPR, 2020

  6. [6]

    LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents

    Jae-Woo Choi, Youngwoo Yoon, Hyobin Ong, Jaehong Kim, and Minsu Jang. LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents. InICLR, 2024

  7. [7]

    GOAT- Bench: A Benchmark for Multi-Modal Lifelong Navigation

    Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, and Roozbeh Mottaghi. GOAT- Bench: A Benchmark for Multi-Modal Lifelong Navigation. InCVPR, 2024

  8. [8]

    Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making

    Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li Erran Li, Ruohan Zhang, et al. Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making. InNeurIPS, 2024

  9. [9]

    EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents.arXiv preprint arXiv:2501.11858, 2025

    Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, et al. EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents.arXiv preprint arXiv:2501.11858, 2025

  10. [10]

    How Foundational Skills Influence VLM-based Embodied Agents: A Native Perspective

    Bo Peng, Pi Bu, Keyu Pan, Xinrun Xu, Yinxiu Zhao, Miao Chen, Yang Du, Lin Li, Jun Song, and Tong Xu. How Foundational Skills Influence VLM-based Embodied Agents: A Native Perspective. InAAAI, 2026. See also arXiv preprint arXiv:2602.20687

  11. [11]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Vu, et al. Language Models (Mostly) Know What They Know.arXiv preprint arXiv:2207.05221, 2022

  12. [12]

    Allen Z. Ren, Anushri Dixit, Alexandra Bodrova, Sumeet Singh, Stephen Tu, Noah Brown, Peng Xu, Leila Takayama, Fei Xia, Jake Varley, Zhenjia Xu, Dorsa Sadigh, Andy Zeng, and Anirudha Majumdar. Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners. InCoRL, 2023

  13. [13]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner Monologue: Embodied Reasoning through Planning with Language Models. InCoRL, 2022. 11

  14. [14]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. In ICLR, 2021

  15. [15]

    VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation

    Kaizhi Zheng, Xiaotong Chen, Odest Jenkins, and Xin Eric Wang. VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation. InNeurIPS, 2022

  16. [16]

    BEHA VIOR-1K: A Benchmark for Embodied AI with 1,000 Everyday Activities and Realistic Simulation

    Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martin- Martin, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al. BEHA VIOR-1K: A Benchmark for Embodied AI with 1,000 Everyday Activities and Realistic Simulation. InCoRL, 2023

  17. [17]

    TEACh: Task-Driven Embodied Agents That Chat

    Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan-Chen, Spandana Gella, Robinson Piramuthu, Gokhan Tur, and Dilek Hakkani-Tur. TEACh: Task-Driven Embodied Agents That Chat. InAAAI, 2022

  18. [18]

    ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments

    Taewoong Kim, Cheolhong Min, Byeonghwi Kim, Jinyeon Kim, Wonje Jeung, and Jonghyun Choi. ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments. In ECCV, 2024

  19. [19]

    SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Danny Driess, Pete Florence, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities. InCVPR, 2024

  20. [20]

    SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models. InNeurIPS, 2024

  21. [21]

    OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

    Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models. InICLR, 2026. See also arXiv preprint arXiv:2506.03135

  22. [22]

    MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

    Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence. InICLR, 2025

  23. [23]

    Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models

    Xingrui Wang, Wufei Ma, Tiezheng Zhang, Celso M de Melo, Jieneng Chen, and Alan Yuille. Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models. In CVPR, 2025

  24. [24]

    RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics

    Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox. RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics. InCoRL, 2024

  25. [25]

    RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

    Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics. InCVPR, 2025

  26. [26]

    Cambrian-S: Towards Spatial Supersensing in Video.arXiv preprint arXiv:2511.04670, 2025

    Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei-Fei, and Saining Xie. Cambrian-S: Towards Spatial Supersensing in Video.arXiv preprint arXiv:2511.04670, 2025

  27. [27]

    VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

    Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Eric Xin Wang, and Achuta Kadambi. VLM4D: Towards Spatiotemporal Awareness in Vision Language Models. InICCV, 2025. 12

  28. [28]

    Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

    Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces. InCVPR, 2024

  29. [29]

    Spatial Mental Modeling from Limited Views

    Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, and Li Fei-Fei. Spatial Mental Modeling from Limited Views. InICLR, 2025

  30. [30]

    AI2-THOR: An Interactive 3D Environment for Visual AI

    Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. AI2-THOR: An Interactive 3D Environment for Visual AI.arXiv preprint arXiv:1712.05474, 2017

  31. [31]

    ProcTHOR: Large-Scale Embodied AI Using Procedural Generation

    Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. InNeurIPS, 2022

  32. [32]

    Gemini 3.1 Pro model card

    Google DeepMind. Gemini 3.1 Pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, 2026

  33. [33]

    Seed1.5-VL Technical Report

    Dong Guo, Faming Wu, Feida Zhu, et al. Seed1.5-VL Technical Report.arXiv preprint arXiv:2505.07062, 2025

  34. [34]

    Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026

    OpenAI. Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026

  35. [35]

    Introducing Claude 4.https://www.anthropic.com/news/claude-4, May 2025

    Anthropic. Introducing Claude 4.https://www.anthropic.com/news/claude-4, May 2025

  36. [36]

    Qwen3.6-27B: Flagship-level coding in a 27B dense model

    Qwen Team. Qwen3.6-27B: Flagship-level coding in a 27B dense model. https://qwen.ai/ blog?id=qwen3.6-27b, April 2026

  37. [37]

    Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog?id=qwen3.5, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog?id=qwen3.5, February 2026

  38. [38]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631, 2025

  39. [39]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Zhe Chen, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  40. [40]

    MiMo-Embodied: X-Embodied Foundation Model Technical Report

    Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zhang, Guang Li, Zheng Lu, et al. MiMo-Embodied: X-Embodied Foundation Model Technical Report.arXiv preprint arXiv:2511.16518, 2025

  41. [41]

    Kimi-VL Technical Report

    Kimi Team, Angang Du, et al. Kimi-VL Technical Report.arXiv preprint arXiv:2504.07491, 2025

  42. [42]

    RoboBrain: A Unified Brain Model for Robotic Manipula- tion from Abstract to Concrete.arXiv preprint arXiv:2502.21257, 2025

    Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. RoboBrain: A Unified Brain Model for Robotic Manipula- tion from Abstract to Concrete.arXiv preprint arXiv:2502.21257, 2025

  43. [43]

    RynnBrain: Open Embodied Foundation Models.arXiv preprint arXiv:2602.14979, 2026

    Ronghao Dang, Jiayan Guo, Bohan Hou, Sicong Leng, Kehan Li, Xin Li, Jiangping Liu, Yunxuan Mao, Zhikai Wang, Yuqian Yuan, Minghao Zhu, Xiao Lin, Yang Bai, Qian Jiang, Yaxi Zhao, Minghua Zeng, Junlong Gao, Yuming Jiang, Jun Cen, Siteng Huang, Liuyi Wang, Wenqiao Zhang, Chengju Liu, Jianfei Yang, Shijian Lu, and Deli Zhao. RynnBrain: Open Embodied Foundatio...

  44. [44]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, et al. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388, 2025. 13 A System Prompt Specification The system prompt is assembled programmatically from four blocks in fixed order. No privileged simulator state crosses the agent–evaluator boundary. We reproduce each block verbatim below; each benchmark ru...

  45. [45]

    honest fail

    Invalid-action limit:cumulative invalid actions (protocol failures + malformed actions) exceed the family-specific limit; scored asinvalid_action_limit_exceeded. C Scoring Details C.1 Dual-Metric Evaluation Each episode is evaluated under two success metrics simultaneously: •Semantic(primary): tolerant of minor imprecision in object placement or state mat...

  46. [46]

    LLM proposal: a language model drafts candidate tasks conditioned on scene inventories and family-specific constraints (target visibility, start-pose requirements, available intents, object cate- gories)

  47. [47]

    Simulator validation: each proposal is instantiated in AI2-THOR and validated for object existence, state accessibility, agent reachability, and success-condition solvability. The validation engine checks episode-contract integrity including agent initialization, scene setup consistency, success- spec type validity, and family-specific rules (e.g., SM req...

  48. [48]

    Thinking

    Human review: a human auditor reviews a stratified sample for ambiguity, instruction quality, and difficulty calibration. Manually approved episodes receive priority in pack assembly. E.2 Pack Composition The evaluation uses pack mixed_mainline_manual_balanced_1000, containing 1,000 episodes with exactly 125 per task family. These episodes are selected fr...