ContextFlow: Hierarchical Task-State Alignment for Long-Horizon Embodied Agents

Haifei Liu; Kun Zhang; Quanming Yao; Shuhan Guo; Xingyu Gao; Yaqing Wang; Yongqi Zhang

arxiv: 2605.19314 · v1 · pith:P3DBUZ2Snew · submitted 2026-05-19 · 💻 cs.RO · cs.AI

ContextFlow: Hierarchical Task-State Alignment for Long-Horizon Embodied Agents

Shuhan Guo , Kun Zhang , Haifei Liu , Xingyu Gao , Yongqi Zhang , Yaqing Wang , Quanming Yao This is my paper

Pith reviewed 2026-05-20 05:54 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords embodied agentstask-state alignmentlong-horizon planninghierarchical controlevidence packetsscoped updatesrobot task execution

0 comments

The pith

ContextFlow maintains coherent task frontiers in long-horizon embodied agents through explicit contracts and evidence-based scoped updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

As specialist executors handle local tasks better, long-horizon embodied agents face misalignment between the planner's stage, runtime evidence, memory, and the delegated executor. This can cause bad handoffs, locked stages, or repeated replanning. ContextFlow addresses this by turning stages into contracts and observations into evidence packets that trigger targeted updates such as continue, refine, transfer, promote, or repair. A sympathetic reader would care because this makes the agent's overall task progress more reliable and easier to inspect without replacing the local controllers.

Core claim

ContextFlow is an inspectable alignment framework that represents stages as explicit contracts, converts runtime observations into evidence packets, and applies scoped updates including continue, refine, transfer, promote, and repair. ContextFlow keeps specialist executors responsible for local closed-loop control while making task-frontier alignment explicit and auditable. Experiments and demonstration traces on long-horizon embodied tasks illustrate how evidence-grounded scoped updates diagnose and mitigate recurring task-state failures.

What carries the argument

The scoped update mechanism driven by evidence packets, which diagnoses task-state misalignment and selects from continue, refine, transfer, promote, or repair actions to realign the task frontier.

If this is right

Task-state misalignment emerges as the dominant failure mode once local skills are reliable.
Explicit stage contracts enable auditable transitions between planning and execution layers.
Evidence packets allow targeted fixes that avoid full replanning on minor inconsistencies.
Alignment across planner, monitor, memory, and executor reduces unsupported handoffs and stage lock.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar evidence packets and updates could help software agents stay aligned when chaining many tool calls over long sessions.
Storing the generated evidence packets would create a traceable record useful for post-deployment debugging of agent decisions.
The contract and update rules might eventually be refined automatically from patterns in past misalignment cases.

Load-bearing premise

The assumption that task-state misalignment is the main bottleneck after local skills improve and that scoped updates based on evidence packets will reliably diagnose and mitigate it across varied long-horizon tasks.

What would settle it

Running ContextFlow on a long-horizon task where evidence packets are correctly generated yet the agent still shows repeated stage lock or context mismatch would falsify the effectiveness of the scoped updates.

Figures

Figures reproduced from arXiv: 2605.19314 by Haifei Liu, Kun Zhang, Quanming Yao, Shuhan Guo, Xingyu Gao, Yaqing Wang, Yongqi Zhang.

**Figure 2.** Figure 2: Overview of CONTEXTFLOW as an inspectable alignment layer between high-level planning and grounded expert execution. Stage contracts expose planner commitments; memory records short-term subtask context and task-level history; expert executors perform local closed-loop execution; the asynchronous monitor converts observations and executor status into evidence packets; and the planner applies scoped updat… view at source ↗

**Figure 3.** Figure 3: Observable task-state failure cases used to construct the diagnostic split. Each panel pairs [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Representative runtime trace from the constructed split. The instruction is decomposed [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Auxiliary memory-guided closed-loop case. Memory provides retrieved context for the [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

read the original abstract

Long-horizon embodied agents increasingly delegate navigation, search, approach, and manipulation to specialist executors. As these executors become stronger, the main bottleneck shifts from local skill execution to maintaining a coherent task frontier across planning, monitoring, memory, and execution. We study task-state misalignment, a task-level consistency failure in which the planner's active stage, runtime evidence, remembered context, and delegated executor no longer justify the same next-step decision. This failure can lead to unsupported handoffs, stage lock, executor-context mismatch, and unnecessary replanning. We propose ContextFlow, an inspectable alignment framework that represents stages as explicit contracts, converts runtime observations into evidence packets, and applies scoped updates including continue, refine, transfer, promote, and repair. ContextFlow keeps specialist executors responsible for local closed-loop control while making task-frontier alignment explicit and auditable. Experiments and demonstration traces on long-horizon embodied tasks illustrate how evidence-grounded scoped updates diagnose and mitigate recurring task-state failures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ContextFlow gives a clean way to make task alignment explicit and auditable in long-horizon agents, but the support is limited to illustrative traces without quantitative tests or baselines.

read the letter

ContextFlow treats task-state misalignment as the main remaining problem once local executors get stronger. The authors define it as drift between the planner's stage, runtime evidence, remembered context, and the delegated executor, then introduce explicit stage contracts, evidence packets from observations, and five scoped updates: continue, refine, transfer, promote, and repair. This keeps specialist modules responsible for their closed-loop work while exposing the alignment layer for inspection.

Referee Report

1 major / 1 minor

Summary. The paper proposes ContextFlow, a framework for addressing task-state misalignment in long-horizon embodied agents. As specialist executors improve, the bottleneck shifts to maintaining coherence between the planner's active stage, runtime evidence, remembered context, and delegated executor. ContextFlow represents stages as explicit contracts, converts observations into evidence packets, and applies scoped updates (continue, refine, transfer, promote, repair) to diagnose and correct inconsistencies while preserving local closed-loop control by executors. The approach is illustrated via demonstration traces and experiments on long-horizon embodied tasks.

Significance. If the central claims hold, ContextFlow offers a promising, inspectable method for explicit task-frontier alignment that separates global consistency from local skill execution. This could enhance reliability and auditability in complex embodied scenarios without requiring retraining of specialist modules. The framework's emphasis on evidence packets and scoped updates provides a concrete mechanism for mitigating failures like unsupported handoffs and stage lock, representing a useful conceptual contribution to hierarchical agent design.

major comments (1)

[Experiments] The manuscript supports the claim that scoped updates mitigate task-state misalignment primarily through illustrative demonstration traces on selected long-horizon scenarios, with no quantitative metrics, baseline comparisons, or ablations on the update rules or evidence extraction step. This is load-bearing for the central claim that task-state misalignment is the dominant bottleneck and that the alignment mechanism reliably diagnoses and corrects it across varied tasks.

minor comments (1)

[Abstract] The abstract refers to 'experiments and demonstration traces' without naming the specific tasks, environments, or any success criteria; adding this detail would clarify the scope of the evaluation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the potential of ContextFlow as an inspectable framework for task-state alignment. We address the major comment below.

read point-by-point responses

Referee: [Experiments] The manuscript supports the claim that scoped updates mitigate task-state misalignment primarily through illustrative demonstration traces on selected long-horizon scenarios, with no quantitative metrics, baseline comparisons, or ablations on the update rules or evidence extraction step. This is load-bearing for the central claim that task-state misalignment is the dominant bottleneck and that the alignment mechanism reliably diagnoses and corrects it across varied tasks.

Authors: We agree that the current manuscript relies primarily on illustrative demonstration traces rather than quantitative metrics, baseline comparisons, or ablations. These traces were selected to demonstrate the inspectability of the framework by showing concrete examples of how evidence packets and scoped updates (continue, refine, transfer, promote, repair) diagnose and resolve specific failures such as unsupported handoffs and stage lock. However, we recognize that this leaves the central claims about task-state misalignment as the dominant bottleneck and the reliability of the mechanism without broader empirical support. In the revised manuscript we will add quantitative evaluation, including success rates and misalignment resolution rates across multiple long-horizon embodied tasks, baseline comparisons against standard hierarchical planners without explicit alignment, and ablations isolating the contributions of the update rules and evidence extraction step. revision: yes

Circularity Check

0 steps flagged

No circularity in framework proposal or alignment mechanism

full rationale

The manuscript defines task-state misalignment as a consistency failure across planner stage, evidence, context, and executor, then proposes ContextFlow as an explicit framework using contracts, evidence packets, and scoped updates (continue/refine/transfer/promote/repair). No equations, fitted parameters, or derivations appear that reduce any claimed result to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central contribution is presented as a design choice for inspectability and auditability, supported by illustrative traces rather than any self-referential reduction. This is a standard non-circular engineering proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Information is limited to the abstract; no quantitative free parameters are specified. The core premise is a domain assumption about bottlenecks in embodied agents.

axioms (1)

domain assumption Task-state misalignment is the primary bottleneck in long-horizon embodied agents as specialist executors become stronger.
Stated directly in the abstract as the shift from local execution to task-frontier consistency.

invented entities (1)

ContextFlow framework no independent evidence
purpose: To provide explicit contracts, evidence packets, and scoped updates for task-state alignment.
Newly introduced as the proposed solution in the abstract.

pith-pipeline@v0.9.0 · 5718 in / 1249 out tokens · 66129 ms · 2026-05-20T05:54:50.845637+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages

[1]

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, et al. Do as i can, not as i say: Grounding language in robotic affordances. InCoRL, 2022

work page 2022
[2]

Anderson, Q

P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. S”underhauf, I. Reid, S. Gould, and A. van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InCVPR, 2018

work page 2018
[3]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. In CoRL, 2023

work page 2023
[4]

D. S. Chaplot, D. Gandhi, A. Gupta, and R. Salakhutdinov. Object goal navigation using goal- oriented semantic exploration. InNeurIPS, 2020

work page 2020
[5]

J. Chen, B. Lin, X. Liu, L. Ma, X. Liang, and K.-Y . K. Wong. Affordances-oriented planning using foundation models for continuous vision-language navigation. InAAAI, 2025

work page 2025
[6]

Chen, P.-L

S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev. History aware multimodal trans- former for vision-and-language navigation. InNeurIPS, 2021

work page 2021
[7]

Chen, P.-L

S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev. Think Global, Act Local: Dual- scale graph transformer for vision-and-language navigation. InCVPR, 2022

work page 2022
[8]

Cornelio and M

C. Cornelio and M. Diab. Recover: A neuro-symbolic framework for failure detection and recovery. InIROS, 2024

work page 2024
[9]

Driess, F

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, et al. PaLM-E: An embodied multimodal language model. InICML, 2023

work page 2023
[10]

S. Y . Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song. CoWs on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. InCVPR, 2023

work page 2023
[11]

C. R. Garrett, T. Lozano-P ´erez, and L. P. Kaelbling. PDDLStream: Integrating symbolic plan- ners and blackbox samplers via optimistic adaptive planning. InICAPS, 2020

work page 2020
[12]

Guo, Y .-J

Y . Guo, Y .-J. Wang, L. Zha, and J. Chen. Doremi: Grounding language model by detecting and recovering from plan-execution misalignment. InIROS, 2024

work page 2024
[13]

Honerkamp, M

D. Honerkamp, M. B”uchner, F. Despinoy, T. Welschehold, and A. Valada. Language- grounded dynamic scene graphs for interactive object search with mobile manipulation.IEEE RA-L, 2024

work page 2024
[14]

Y . Hong, C. Rodriguez, Y . Qi, Q. Wu, and S. Gould. VLN BERT: A recurrent vision-and- language BERT for navigation. InCVPR, 2021

work page 2021
[15]

M. Hu, Y . Mu, X. Yu, M. Ding, S. Wu, W. Shao, Q. Chen, B. Wang, Y . Qiao, and P. Luo. Tree-planner: Efficient close-loop task planning with large language models. InICLR, 2024. 11

work page 2024
[16]

Huang, F

W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, et al. Inner monologue: Embodied reasoning through planning with language models. InCoRL, 2023

work page 2023
[17]

L. P. Kaelbling and T. Lozano-P ´erez. Integrated task and motion planning in belief space. IJRR, 2013

work page 2013
[18]

Krantz, E

J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee. Beyond the nav-graph: Vision-and- language navigation in continuous environments. InECCV, 2020

work page 2020
[19]

Laban, H

P. Laban, H. Hayashi, Y . Zhou, and J. Neville. LLMs get lost in multi-turn conversation. In ICLR, 2026

work page 2026
[20]

C. Li, R. Zhang, J. Wong, C. Gokmen, S. Srivastava, R. Mart ´ın-Mart´ın, C. Wang, G. Levine, et al. BEHA VIOR-1k: A benchmark for embodied AI with 1,000 everyday activities and realistic simulation. InCoRL, 2023

work page 2023
[21]

Liang, W

J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. InICRA, 2023

work page 2023
[22]

P. Liu, Z. Guo, M. Warke, S. Chintala, C. Paxton, N. M. M. Shafiullah, and L. Pinto. Dynamem: Online dynamic spatio-semantic memory for open world mobile manipulation.arXiv, 2024

work page 2024
[23]

P. Liu, Y . Orru, J. Vakil, C. Paxton, N. M. M. Shafiullah, and L. Pinto. OK-Robot: What really matters in integrating open-knowledge models for robotics. InRSS, 2024

work page 2024
[24]

Z. Liu, A. Bahety, and S. Song. REFLECT: Summarizing robot experiences for failure expla- nation and correction. InCoRL, 2023

work page 2023
[25]

Y . Long, W. Cai, H. Wang, G. Zhan, and H. Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.arXiv, 2024

work page 2024
[26]

Majumdar, G

A. Majumdar, G. Aggarwal, B. Devnani, J. Hoffman, and D. Batra. ZSON: Zero-shot object- goal navigation using multimodal goal embeddings. InNeurIPS, 2022

work page 2022
[27]

Mei, G.-N

A. Mei, G.-N. Zhu, H. Zhang, and Z. Gan. ReplanVLM: Replanning robotic tasks with visual language models.arXiv, 2024

work page 2024
[28]

Padmakumar, J

A. Padmakumar, J. Thomason, A. Shrivastava, P. Lange, A. Narayan-Chen, S. Gella, R. Pira- muthu, G. Tur, and D. Hakkani-T¨ur. TEACh: Task-driven embodied agents that chat. InAAAI, 2022

work page 2022
[29]

R. Shah, A. Yu, Y . Zhu, Y . Zhu, and R. Martin-Martin. BUMBLE: Unifying reasoning and acting with vision-language models for building-wide mobile manipulation.arXiv, 2024

work page 2024
[30]

Shridhar, J

M. Shridhar, J. Thomason, D. Gordon, Y . Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox. ALFRED: A benchmark for interpreting grounded instructions for everyday tasks. In CVPR, 2020

work page 2020
[31]

Singh, V

I. Singh, V . Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg. Progprompt: Program generation for situated robot task planning using large language models.Auton. Robots, 2023

work page 2023
[32]

C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, and Y . Su. LLM-Planner: Few- shot grounded planning for embodied agents with large language models. InICCV, 2023

work page 2023
[33]

Yenamandra, A

S. Yenamandra, A. Ramachandran, K. Yadav, A. Wang, M. Khanna, T. Gervet, T.-Y . Yang, V . Jain, et al. Homerobot: Open-vocabulary mobile manipulation. InCoRL, 2023

work page 2023
[34]

Zhang, Y

M. Zhang, Y . Du, C. Wu, J. Zhou, Z. Qi, J. Ma, and B. Zhou. Apexnav: An adaptive exploration strategy for zero-shot object navigation with target-centric semantic fusion.IEEE RA-L, 2025

work page 2025
[35]

wait near

E. Zhou, Q. Su, C. Chi, Z. Zhang, Z. Wang, T. Huang, L. Sheng, and H. Wang. Code-as- monitor: Constraint-aware visual programming for reactive and proactive robotic failure de- tection. InCVPR, 2026. 12 A Auxiliary Memory and Monitor Analyses Memory is treated as auxiliary runtime context in the main paper. It records recent observations, completed stages...

work page 2026

[1] [1]

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, et al. Do as i can, not as i say: Grounding language in robotic affordances. InCoRL, 2022

work page 2022

[2] [2]

Anderson, Q

P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. S”underhauf, I. Reid, S. Gould, and A. van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InCVPR, 2018

work page 2018

[3] [3]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. In CoRL, 2023

work page 2023

[4] [4]

D. S. Chaplot, D. Gandhi, A. Gupta, and R. Salakhutdinov. Object goal navigation using goal- oriented semantic exploration. InNeurIPS, 2020

work page 2020

[5] [5]

J. Chen, B. Lin, X. Liu, L. Ma, X. Liang, and K.-Y . K. Wong. Affordances-oriented planning using foundation models for continuous vision-language navigation. InAAAI, 2025

work page 2025

[6] [6]

Chen, P.-L

S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev. History aware multimodal trans- former for vision-and-language navigation. InNeurIPS, 2021

work page 2021

[7] [7]

Chen, P.-L

S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev. Think Global, Act Local: Dual- scale graph transformer for vision-and-language navigation. InCVPR, 2022

work page 2022

[8] [8]

Cornelio and M

C. Cornelio and M. Diab. Recover: A neuro-symbolic framework for failure detection and recovery. InIROS, 2024

work page 2024

[9] [9]

Driess, F

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, et al. PaLM-E: An embodied multimodal language model. InICML, 2023

work page 2023

[10] [10]

S. Y . Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song. CoWs on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. InCVPR, 2023

work page 2023

[11] [11]

C. R. Garrett, T. Lozano-P ´erez, and L. P. Kaelbling. PDDLStream: Integrating symbolic plan- ners and blackbox samplers via optimistic adaptive planning. InICAPS, 2020

work page 2020

[12] [12]

Guo, Y .-J

Y . Guo, Y .-J. Wang, L. Zha, and J. Chen. Doremi: Grounding language model by detecting and recovering from plan-execution misalignment. InIROS, 2024

work page 2024

[13] [13]

Honerkamp, M

D. Honerkamp, M. B”uchner, F. Despinoy, T. Welschehold, and A. Valada. Language- grounded dynamic scene graphs for interactive object search with mobile manipulation.IEEE RA-L, 2024

work page 2024

[14] [14]

Y . Hong, C. Rodriguez, Y . Qi, Q. Wu, and S. Gould. VLN BERT: A recurrent vision-and- language BERT for navigation. InCVPR, 2021

work page 2021

[15] [15]

M. Hu, Y . Mu, X. Yu, M. Ding, S. Wu, W. Shao, Q. Chen, B. Wang, Y . Qiao, and P. Luo. Tree-planner: Efficient close-loop task planning with large language models. InICLR, 2024. 11

work page 2024

[16] [16]

Huang, F

W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, et al. Inner monologue: Embodied reasoning through planning with language models. InCoRL, 2023

work page 2023

[17] [17]

L. P. Kaelbling and T. Lozano-P ´erez. Integrated task and motion planning in belief space. IJRR, 2013

work page 2013

[18] [18]

Krantz, E

J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee. Beyond the nav-graph: Vision-and- language navigation in continuous environments. InECCV, 2020

work page 2020

[19] [19]

Laban, H

P. Laban, H. Hayashi, Y . Zhou, and J. Neville. LLMs get lost in multi-turn conversation. In ICLR, 2026

work page 2026

[20] [20]

C. Li, R. Zhang, J. Wong, C. Gokmen, S. Srivastava, R. Mart ´ın-Mart´ın, C. Wang, G. Levine, et al. BEHA VIOR-1k: A benchmark for embodied AI with 1,000 everyday activities and realistic simulation. InCoRL, 2023

work page 2023

[21] [21]

Liang, W

J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. InICRA, 2023

work page 2023

[22] [22]

P. Liu, Z. Guo, M. Warke, S. Chintala, C. Paxton, N. M. M. Shafiullah, and L. Pinto. Dynamem: Online dynamic spatio-semantic memory for open world mobile manipulation.arXiv, 2024

work page 2024

[23] [23]

P. Liu, Y . Orru, J. Vakil, C. Paxton, N. M. M. Shafiullah, and L. Pinto. OK-Robot: What really matters in integrating open-knowledge models for robotics. InRSS, 2024

work page 2024

[24] [24]

Z. Liu, A. Bahety, and S. Song. REFLECT: Summarizing robot experiences for failure expla- nation and correction. InCoRL, 2023

work page 2023

[25] [25]

Y . Long, W. Cai, H. Wang, G. Zhan, and H. Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.arXiv, 2024

work page 2024

[26] [26]

Majumdar, G

A. Majumdar, G. Aggarwal, B. Devnani, J. Hoffman, and D. Batra. ZSON: Zero-shot object- goal navigation using multimodal goal embeddings. InNeurIPS, 2022

work page 2022

[27] [27]

Mei, G.-N

A. Mei, G.-N. Zhu, H. Zhang, and Z. Gan. ReplanVLM: Replanning robotic tasks with visual language models.arXiv, 2024

work page 2024

[28] [28]

Padmakumar, J

A. Padmakumar, J. Thomason, A. Shrivastava, P. Lange, A. Narayan-Chen, S. Gella, R. Pira- muthu, G. Tur, and D. Hakkani-T¨ur. TEACh: Task-driven embodied agents that chat. InAAAI, 2022

work page 2022

[29] [29]

R. Shah, A. Yu, Y . Zhu, Y . Zhu, and R. Martin-Martin. BUMBLE: Unifying reasoning and acting with vision-language models for building-wide mobile manipulation.arXiv, 2024

work page 2024

[30] [30]

Shridhar, J

M. Shridhar, J. Thomason, D. Gordon, Y . Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox. ALFRED: A benchmark for interpreting grounded instructions for everyday tasks. In CVPR, 2020

work page 2020

[31] [31]

Singh, V

I. Singh, V . Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg. Progprompt: Program generation for situated robot task planning using large language models.Auton. Robots, 2023

work page 2023

[32] [32]

C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, and Y . Su. LLM-Planner: Few- shot grounded planning for embodied agents with large language models. InICCV, 2023

work page 2023

[33] [33]

Yenamandra, A

S. Yenamandra, A. Ramachandran, K. Yadav, A. Wang, M. Khanna, T. Gervet, T.-Y . Yang, V . Jain, et al. Homerobot: Open-vocabulary mobile manipulation. InCoRL, 2023

work page 2023

[34] [34]

Zhang, Y

M. Zhang, Y . Du, C. Wu, J. Zhou, Z. Qi, J. Ma, and B. Zhou. Apexnav: An adaptive exploration strategy for zero-shot object navigation with target-centric semantic fusion.IEEE RA-L, 2025

work page 2025

[35] [35]

wait near

E. Zhou, Q. Su, C. Chi, Z. Zhang, Z. Wang, T. Huang, L. Sheng, and H. Wang. Code-as- monitor: Constraint-aware visual programming for reactive and proactive robotic failure de- tection. InCVPR, 2026. 12 A Auxiliary Memory and Monitor Analyses Memory is treated as auxiliary runtime context in the main paper. It records recent observations, completed stages...

work page 2026