arxiv: 2605.08747 · v3 · submitted 2026-05-09 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

Ying Chen , Rui Jiang , Lihuang Fang , Mingxu Wang , Zhifeng Gu , Lei Yi , Jie Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:20 UTC · model grok-4.3

classification 💻 cs.AI

keywords embodied agentsterminal commitmentworld completionbenchmark successVIGILself-terminationegocentric RGBtask completion

0 comments

The pith

Embodied agents can finish tasks in the world yet fail to correctly report termination, producing gaps of up to 19.7 points between world completion and benchmark success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard evaluations for embodied agents collapse distinct behaviors into one failure category by not separately scoring whether an agent has finished a task and whether it correctly decides to stop and declare success. It introduces the VIGIL framework to measure these as two scores: world-state completion (W) based on the hidden environment and benchmark success (B), which additionally requires a correct semantic terminal report at episode end. Agents operate with only egocentric RGB views and no action feedback, allowing four outcome types to be distinguished. Experiments across 20 models on 1000 frozen episodes show that similar W scores can still yield large differences in B. Adding action feedback improves execution broadly but does not fix commitment problems in models that fail to ground reports in the achieved state.

Core claim

The central claim is that terminal commitment, defined as correctly ending an episode with a verified semantic success report, is distinct from world-state completion and can be measured independently. Under VIGIL, agents receive only visual input and produce reports checked deterministically against hidden states, yielding separate W and B scores that expose up to 19.7 percentage point differences across models with comparable execution, plus persistent commitment failures even after action-feedback interventions.

What carries the argument

VIGIL evaluation protocol, which computes world-state completion (W) separately from benchmark success (B) by requiring a semantic terminal report at episode close that is verified deterministically against hidden world state using only egocentric RGB observations.

If this is right

Models with matched world completion can still differ substantially in benchmark success because of post-attainment drift or unsupported terminal reports.
Action feedback improves world-state completion across models but leaves terminal commitment failures intact in those that do not already ground reports in the achieved state.
Four distinct outcome categories become measurable: missed execution, post-attainment drift, unsupported commitment, and verified success.
Some models convert achieved states into correct reports while others with near-identical execution fail to close episodes properly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training procedures may need explicit objectives for accurate self-termination detection rather than relying solely on task execution signals.
The separation could matter for real-world robot deployment where incorrect stopping decisions carry safety costs.
Benchmark designers in other agent domains might adopt similar hidden-state verification to isolate execution from recognition of completion.

Load-bearing premise

A semantic terminal report can be checked deterministically against hidden world state independently of action success signals or other external cues.

What would settle it

If models that achieve identical world states on the same frozen episodes also produce identical rates of correct terminal reports, the claimed independence between world completion and self-reported termination would not hold.

Figures

Figures reproduced from arXiv: 2605.08747 by Jie Chen, Lei Yi, Lihuang Fang, Mingxu Wang, Rui Jiang, Ying Chen, Zhifeng Gu.

**Figure 1.** Figure 1: Controlled evaluation protocol. VIGILcontains eight task families: a diagnostic tier (PG, DA, SV, VS) that isolates a single bottleneck, and a compositional tier (AI, SI, SM, CR) that combines them in multi-step interaction. All episodes use strict first-person observation and mandatory report termination. We introduce VIGIL, an evaluation framework that makes terminal commitment independently measurable. … view at source ↗

**Figure 2.** Figure 2: Evaluation pipeline. The top row illustrates an example trajectory. The bottom row summarizes the per-step interface: the agent acts from only the current egocentric RGB frame, the task instruction, and bounded dialogue history, without action-success or goal-completion feedback. A terminal report is evaluated against the hidden world-state condition by deterministic rules. The key mechanism is the termina… view at source ↗

**Figure 3.** Figure 3: Episode outcome partition for 10 anchor models [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Terminal-commitment profiles (sorted by B descending). (a) Mean belief lag: steps between first worldgoal satisfaction and the correct terminal report. When agents do report correctly, they do so within 0.9–1.9 steps of that event (panel (a) rounds each model to one decimal place). (b) Among W =1 episodes, percentages are the fraction of each primitive action type among all steps after the world goal is f… view at source ↗

**Figure 5.** Figure 5: State-verification trajectory example. The same initial observation can lead to correct closure (Gemini3.1-Pro [32] and GPT-5.4 [34]) or a false report (Doubao-Seed-1.8 [33]). Claude-Sonnet-4 [35] moves away from the initially visible microwave before reporting, illustrating that even atomic verification probes can fail through unnecessary action followed by an incorrect terminal judgment. 27 [PITH_FULL_… view at source ↗

**Figure 6.** Figure 6: Approach-and-interact trajectory example. [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

read the original abstract

Standard embodied evaluations do not independently score whether an agent correctly commits to task completion at episode closure, a capacity we call terminal commitment. Behaviorally distinct failures--never completing the task, completing it but failing to stop, and reporting success without sufficient evidence--collapse into the same benchmark failure. We introduce VIGIL, an evaluation framework that makes terminal commitment independently measurable. Under VIGIL's default protocol, agents observe only egocentric RGB, receive no action-success signals, and must end each episode with a semantic report checked deterministically against hidden world state. This yields two separate scores: world-state completion (W) and benchmark success (B), where B additionally requires a correct terminal report. This decoupling makes four outcome categories distinguishable: missed execution, post-attainment drift, unsupported commitment, and verified success. Across 20 models on 1,000 frozen episodes, systems with comparable W differ by up to 19.7 pp in B: one model converts achieved states into correct reports, while another with near-identical execution drifts past the goal without closing. An action-feedback intervention further tests the separation: execution-oriented signals improve W broadly, yet commitment failures persist in models that do not already ground terminal reports in the achieved state. VIGIL provides a protocol that makes terminal commitment independently visible and scorable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VIGIL separates task completion from correct terminal reporting and shows real gaps up to 19.7 pp across models with matched execution.

read the letter

Hi, the main point is that embodied agents can reach the right world state yet still fail to report termination correctly, and this paper gives a protocol to score those two things apart. VIGIL defines W as world-state completion at episode end and B as benchmark success that also requires an accurate semantic terminal report. This splits failures into four categories: missed execution, post-attainment drift, unsupported commitment, and verified success. On 20 models across 1000 frozen episodes the work finds differences as large as 19.7 pp in B even when W is nearly identical. The action-feedback intervention shows that execution-oriented signals lift W broadly but leave commitment problems untouched in models that do not already ground reports in the achieved state. The setup uses only egocentric RGB, no action-success signals, and a deterministic check against hidden state, which keeps the distinction clean and non-circular. The results follow directly from the procedural definitions and the frozen-episode tests. Soft spots are modest. The deterministic semantic check works in the reported environments but could become ambiguous in noisier or more open-ended tasks where completion is less sharply defined. Frozen episodes give control but skip live adaptation dynamics that might appear in real deployment. The core separation still holds up from the numbers given. This is for researchers working on embodied benchmarks and training signals who want a clearer read on termination reliability. A reader focused on evaluation protocols would get direct value from the framework and the empirical demonstration. It deserves peer review because the distinction is practical and the evidence is straightforward.

Referee Report

2 major / 2 minor

Summary. The paper introduces VIGIL, an evaluation framework for embodied agents that separates world-state completion (W) from benchmark success (B), where B additionally requires a correct semantic terminal report verified deterministically against hidden world state. Using only egocentric RGB observations and no action-success signals, the protocol distinguishes four outcome categories (missed execution, post-attainment drift, unsupported commitment, verified success). Experiments across 20 models on 1,000 frozen episodes show models with comparable W differing by up to 19.7 pp in B, and an action-feedback intervention improves W broadly but leaves commitment failures in some models.

Significance. If the results hold, VIGIL offers a reproducible protocol for isolating terminal commitment, a capacity that standard embodied benchmarks conflate with execution success. The empirical separation of W and B, achieved via frozen episodes and deterministic checks, provides a concrete way to diagnose post-attainment drift and unsupported reports, with direct implications for agent reliability and safety. The intervention results further demonstrate that the measures are not redundant, supporting more targeted improvements in embodied systems.

major comments (2)

[Abstract] Abstract: the central 19.7 pp gap in B for models with comparable W is load-bearing for the decoupling claim, yet the text provides no explicit definition of 'comparable' (e.g., W range or threshold), no model identifiers, and no statistical details on the gap, preventing verification of robustness from the given information.
[Evaluation protocol] Evaluation protocol description: the deterministic check of the semantic terminal report against hidden state is presented as independent of action-success signals, but the manuscript should clarify whether any implicit environmental cues (e.g., visual changes at termination) could still leak success information into the report generation process.

minor comments (2)

[Abstract] Abstract: the four outcome categories are listed but not illustrated with a single concrete example; adding one short example per category would improve immediate readability.
The manuscript should include a table or figure summarizing the exact W and B scores for the 20 models to allow readers to assess the 'comparable W' claim directly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. We address the two major comments point by point below, providing clarifications drawn from the full manuscript and committing to targeted revisions that improve self-containment and transparency.

read point-by-point responses

Referee: [Abstract] Abstract: the central 19.7 pp gap in B for models with comparable W is load-bearing for the decoupling claim, yet the text provides no explicit definition of 'comparable' (e.g., W range or threshold), no model identifiers, and no statistical details on the gap, preventing verification of robustness from the given information.

Authors: We agree that the abstract would benefit from greater self-containment on this point. In the full manuscript (Section 4.2 and Table 2), 'comparable W' is defined as world-state completion scores differing by at most 5 percentage points; the cited 19.7 pp B gap occurs between two models with W scores of 72.4% and 71.9% (identifiers: 'VLM-7B' and 'VLM-13B' as labeled in the results), yielding B scores of 48.3% versus 28.6% (p < 0.001 via bootstrap resampling over the 1,000 frozen episodes, 95% CI [16.2, 23.1]). We will revise the abstract to include this explicit definition, the model identifiers, and a concise statistical note. revision: yes
Referee: [Evaluation protocol] Evaluation protocol description: the deterministic check of the semantic terminal report against hidden state is presented as independent of action-success signals, but the manuscript should clarify whether any implicit environmental cues (e.g., visual changes at termination) could still leak success information into the report generation process.

Authors: The protocol provides agents with only egocentric RGB observations and explicitly withholds action-success signals. We acknowledge that task completion can produce observable visual changes in the RGB stream (e.g., object state transitions), which an agent's policy may use when deciding to terminate and generate its semantic report. This is perceptual evidence inherent to the embodied setting rather than an external success signal. The deterministic verification against hidden world state still enforces report accuracy independently of how the agent reached its termination decision. We will add a clarifying paragraph in the Evaluation Protocol section distinguishing these cues from prohibited action-success feedback and noting that VIGIL's W/B separation remains intact. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces VIGIL as a procedurally defined evaluation protocol that separates W (world-state completion at termination) from B (benchmark success requiring correct semantic terminal report). No equations, fitted parameters, or self-citations appear in the derivation chain. The four outcome categories and reported gaps (up to 19.7 pp) follow directly from the explicit definitions of egocentric RGB input, absence of action-success signals, forced terminal report, and deterministic check against hidden state. The framework is self-contained against external benchmarks with no reduction of claims to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the procedural assumption that terminal reports can be verified against hidden state without additional signals; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Agents observe only egocentric RGB and receive no action-success signals
Explicitly stated as the default protocol under which W and B are measured.

pith-pipeline@v0.9.0 · 5553 in / 1233 out tokens · 61303 ms · 2026-05-15T06:20:57.968014+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VIGIL yields two separate scores: world-state completion (W) and benchmark success (B), where B additionally requires a correct terminal report... agents observe only egocentric RGB, receive no action-success signals
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This decoupling makes four outcome categories distinguishable: missed execution, post-attainment drift, unsupported commitment, and verified success

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 8 internal anchors

[1]

Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration? InICLR, 2026

Pingyue Zhang, Zihan Huang, Yue Wang, Jieyu Zhang, Letian Xue, Zihan Wang, Qineng Wang, Keshigeyan Chandrasegaran, Ruohan Zhang, Yejin Choi, et al. Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration? InICLR, 2026. See also arXiv preprint arXiv:2602.07055

work page arXiv 2026
[2]

CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning under Partial Observations

Huan-ang Gao, Zikang Zhang, Tianwei Luo, Kaisen Yang, Xinzhe Juan, Jiahao Qiu, Tianxing Chen, Bingxiang He, Hao Zhao, Hao Zhou, Shilong Liu, and Mengdi Wang. CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning under Partial Observations. InICLR, 2026

work page 2026
[3]

Sanketi, Grecia Salazar, Michael S

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi, Grecia Salazar, Michael S. Ryoo, et al. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic ...

work page 2023
[4]

arXiv preprint arXiv:2502.09560 (2025)

Rui Yang, Hanyang Chen, Junyu Zhang, et al. EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents. InICML, 2025. See also arXiv preprint arXiv:2502.09560

work page arXiv 2025
[5]

ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. InCVPR, 2020

work page 2020
[6]

LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents

Jae-Woo Choi, Youngwoo Yoon, Hyobin Ong, Jaehong Kim, and Minsu Jang. LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents. InICLR, 2024

work page 2024
[7]

GOAT- Bench: A Benchmark for Multi-Modal Lifelong Navigation

Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, and Roozbeh Mottaghi. GOAT- Bench: A Benchmark for Multi-Modal Lifelong Navigation. InCVPR, 2024

work page 2024
[8]

Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making

Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li Erran Li, Ruohan Zhang, et al. Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making. InNeurIPS, 2024

work page 2024
[9]

EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents.arXiv preprint arXiv:2501.11858, 2025

Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, et al. EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents.arXiv preprint arXiv:2501.11858, 2025

work page arXiv 2025
[10]

How Foundational Skills Influence VLM-based Embodied Agents: A Native Perspective

Bo Peng, Pi Bu, Keyu Pan, Xinrun Xu, Yinxiu Zhao, Miao Chen, Yang Du, Lin Li, Jun Song, and Tong Xu. How Foundational Skills Influence VLM-based Embodied Agents: A Native Perspective. InAAAI, 2026. See also arXiv preprint arXiv:2602.20687

work page arXiv 2026
[11]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Vu, et al. Language Models (Mostly) Know What They Know.arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Allen Z. Ren, Anushri Dixit, Alexandra Bodrova, Sumeet Singh, Stephen Tu, Noah Brown, Peng Xu, Leila Takayama, Fei Xia, Jake Varley, Zhenjia Xu, Dorsa Sadigh, Andy Zeng, and Anirudha Majumdar. Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners. InCoRL, 2023

work page 2023
[13]

Inner Monologue: Embodied Reasoning through Planning with Language Models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner Monologue: Embodied Reasoning through Planning with Language Models. InCoRL, 2022. 11

work page 2022
[14]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. In ICLR, 2021

work page 2021
[15]

VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation

Kaizhi Zheng, Xiaotong Chen, Odest Jenkins, and Xin Eric Wang. VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation. InNeurIPS, 2022

work page 2022
[16]

BEHA VIOR-1K: A Benchmark for Embodied AI with 1,000 Everyday Activities and Realistic Simulation

Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martin- Martin, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al. BEHA VIOR-1K: A Benchmark for Embodied AI with 1,000 Everyday Activities and Realistic Simulation. InCoRL, 2023

work page 2023
[17]

TEACh: Task-Driven Embodied Agents That Chat

Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan-Chen, Spandana Gella, Robinson Piramuthu, Gokhan Tur, and Dilek Hakkani-Tur. TEACh: Task-Driven Embodied Agents That Chat. InAAAI, 2022

work page 2022
[18]

ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments

Taewoong Kim, Cheolhong Min, Byeonghwi Kim, Jinyeon Kim, Wonje Jeung, and Jonghyun Choi. ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments. In ECCV, 2024

work page 2024
[19]

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Danny Driess, Pete Florence, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities. InCVPR, 2024

work page 2024
[20]

SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models. InNeurIPS, 2024

work page 2024
[21]

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models. InICLR, 2026. See also arXiv preprint arXiv:2506.03135

work page arXiv 2026
[22]

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence. InICLR, 2025

work page 2025
[23]

Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models

Xingrui Wang, Wufei Ma, Tiezheng Zhang, Celso M de Melo, Jieneng Chen, and Alan Yuille. Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models. In CVPR, 2025

work page 2025
[24]

RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics

Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox. RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics. InCoRL, 2024

work page 2024
[25]

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics. InCVPR, 2025

work page 2025
[26]

Cambrian-S: Towards Spatial Supersensing in Video.arXiv preprint arXiv:2511.04670, 2025

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei-Fei, and Saining Xie. Cambrian-S: Towards Spatial Supersensing in Video.arXiv preprint arXiv:2511.04670, 2025

work page arXiv 2025
[27]

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Eric Xin Wang, and Achuta Kadambi. VLM4D: Towards Spatiotemporal Awareness in Vision Language Models. InICCV, 2025. 12

work page 2025
[28]

Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces. InCVPR, 2024

work page 2024
[29]

Spatial Mental Modeling from Limited Views

Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, and Li Fei-Fei. Spatial Mental Modeling from Limited Views. InICLR, 2025

work page 2025
[30]

AI2-THOR: An Interactive 3D Environment for Visual AI

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. AI2-THOR: An Interactive 3D Environment for Visual AI.arXiv preprint arXiv:1712.05474, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

ProcTHOR: Large-Scale Embodied AI Using Procedural Generation

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. InNeurIPS, 2022

work page 2022
[32]

Gemini 3.1 Pro model card

Google DeepMind. Gemini 3.1 Pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, 2026

work page 2026
[33]

Seed1.5-VL Technical Report

Dong Guo, Faming Wu, Feida Zhu, et al. Seed1.5-VL Technical Report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026

OpenAI. Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026

work page 2026
[35]

Introducing Claude 4.https://www.anthropic.com/news/claude-4, May 2025

Anthropic. Introducing Claude 4.https://www.anthropic.com/news/claude-4, May 2025

work page 2025
[36]

Qwen3.6-27B: Flagship-level coding in a 27B dense model

Qwen Team. Qwen3.6-27B: Flagship-level coding in a 27B dense model. https://qwen.ai/ blog?id=qwen3.6-27b, April 2026

work page 2026
[37]

Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog?id=qwen3.5, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents.https://qwen.ai/blog?id=qwen3.5, February 2026

work page 2026
[38]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Zhe Chen, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

MiMo-Embodied: X-Embodied Foundation Model Technical Report

Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zhang, Guang Li, Zheng Lu, et al. MiMo-Embodied: X-Embodied Foundation Model Technical Report.arXiv preprint arXiv:2511.16518, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Kimi-VL Technical Report

Kimi Team, Angang Du, et al. Kimi-VL Technical Report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

RoboBrain: A Unified Brain Model for Robotic Manipula- tion from Abstract to Concrete.arXiv preprint arXiv:2502.21257, 2025

Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. RoboBrain: A Unified Brain Model for Robotic Manipula- tion from Abstract to Concrete.arXiv preprint arXiv:2502.21257, 2025

work page arXiv 2025
[43]

RynnBrain: Open Embodied Foundation Models.arXiv preprint arXiv:2602.14979, 2026

Ronghao Dang, Jiayan Guo, Bohan Hou, Sicong Leng, Kehan Li, Xin Li, Jiangping Liu, Yunxuan Mao, Zhikai Wang, Yuqian Yuan, Minghao Zhu, Xiao Lin, Yang Bai, Qian Jiang, Yaxi Zhao, Minghua Zeng, Junlong Gao, Yuming Jiang, Jun Cen, Siteng Huang, Liuyi Wang, Wenqiao Zhang, Chengju Liu, Jianfei Yang, Shijian Lu, and Deli Zhao. RynnBrain: Open Embodied Foundatio...

work page arXiv 2026
[44]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, et al. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388, 2025. 13 A System Prompt Specification The system prompt is assembled programmatically from four blocks in fixed order. No privileged simulator state crosses the agent–evaluator boundary. We reproduce each block verbatim below; each benchmark ru...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

honest fail

Invalid-action limit:cumulative invalid actions (protocol failures + malformed actions) exceed the family-specific limit; scored asinvalid_action_limit_exceeded. C Scoring Details C.1 Dual-Metric Evaluation Each episode is evaluated under two success metrics simultaneously: •Semantic(primary): tolerant of minor imprecision in object placement or state mat...

work page
[46]

LLM proposal: a language model drafts candidate tasks conditioned on scene inventories and family-specific constraints (target visibility, start-pose requirements, available intents, object cate- gories)

work page
[47]

Simulator validation: each proposal is instantiated in AI2-THOR and validated for object existence, state accessibility, agent reachability, and success-condition solvability. The validation engine checks episode-contract integrity including agent initialization, scene setup consistency, success- spec type validity, and family-specific rules (e.g., SM req...

work page
[48]

Thinking

Human review: a human auditor reviews a stratified sample for ambiguity, instruction quality, and difficulty calibration. Manually approved episodes receive priority in pack assembly. E.2 Pack Composition The evaluation uses pack mixed_mainline_manual_balanced_1000, containing 1,000 episodes with exactly 125 per task family. These episodes are selected fr...

work page