REI-Bench: Can Embodied Agents Understand Vague Human Instructions in Task Planning?

Chenxi Jiang; Chuhao Zhou; Jianfei Yang

arxiv: 2505.10872 · v4 · submitted 2025-05-16 · 💻 cs.RO · cs.AI· cs.CL

REI-Bench: Can Embodied Agents Understand Vague Human Instructions in Task Planning?

Chenxi Jiang , Chuhao Zhou , Jianfei Yang This is my paper

Pith reviewed 2026-05-22 15:19 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CL

keywords robot task planningreferring expressionsvaguenessLLM plannersbenchmarkcontext cognitionembodied agentspragmatic theory

0 comments

The pith

Vagueness in referring expressions causes up to 36.9 percent drops in success rates for LLM-based robot task planners.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that vague referring expressions in human instructions substantially impair large language model based robot task planning. It introduces REI-Bench as the first benchmark to model such vagueness using pragmatic theory and measures the resulting performance degradation. The work finds that success rates fall by as much as 36.9 percent, with most failures caused by planners missing the intended objects. It then presents task-oriented context cognition, a method that converts vague inputs into explicit instructions and outperforms aware prompts, chain-of-thought reasoning, and in-context learning. The findings matter because non-expert users such as the elderly and children routinely produce these vague expressions, and resolving them is necessary for practical robot deployment in homes.

Core claim

The central claim is that vagueness arising from referring expressions in human instructions severely impairs robot task planning by large language models, with performance drops reaching 36.9 percent, mainly from failures to locate the referenced objects. REI-Bench systematically incorporates such vagueness based on pragmatic principles, and task-oriented context cognition resolves it by producing explicit instructions that achieve the best results among tested methods.

What carries the argument

task-oriented context cognition, which generates clear instructions for robots by incorporating task-oriented and environmental context to resolve vagueness in referring expressions

If this is right

Robot task planners must incorporate mechanisms to handle vague referring expressions to reach reliable real-world performance.
Task-oriented context cognition provides a simple mitigation that surpasses aware prompts, chains of thought, and in-context learning.
Most planning failures trace to missing objects, so future systems should prioritize explicit object identification when context is ambiguous.
Benchmarks for embodied agents should include modeled vagueness to reflect the instructions actually produced by elderly and child users.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Planners could integrate real-time scene perception to resolve referring expressions without separate clarification steps.
The benchmark could be extended to multi-turn conversations where vagueness builds across exchanges.
Similar vagueness problems likely affect other instruction-following domains such as robot navigation or manipulation commands.

Load-bearing premise

The benchmark's modeling of vague referring expressions, grounded in pragmatic theory, accurately captures the distribution and impact of real-world vagueness that non-expert users produce when giving instructions to robots.

What would settle it

Running task-oriented context cognition on REI-Bench and checking whether the observed success-rate drop of up to 36.9 percent is eliminated or greatly reduced compared with baseline planners that receive the original vague instructions.

Figures

Figures reproduced from arXiv: 2505.10872 by Chenxi Jiang, Chuhao Zhou, Jianfei Yang.

**Figure 1.** Figure 1: Left: Robots using existing LLM-based task planners can understand clear instructions with explicit referring expressions (REs), but they struggle to resolve implicit REs in multi-turn dialogues. Right: We propose the REI-Bench framework that aims to study real-world HRI scenarios where coreferential vagueness exists in human instructions. ABSTRACT Robot task planning decomposes human instructions into exe… view at source ↗

**Figure 2.** Figure 2: Data curation pipeline of the REI dataset. From a seed instruction, we (1) generate context memory; (2) produce three context variants—Standard, Noised, Short; (3) replace explicit REs with implicit ones across varying degrees. This results in subsets reflecting nine levels of coreferential vagueness, determined by RE types (Explicit/Mixed/Implicit) and context variants. 3 APPROACH Existing works Choi et a… view at source ↗

**Figure 3.** Figure 3: Addressing implicit referring expressions in task planning. Top row: LLM succeeds with explicit REs (“potato”), but misidentifies the object with implicit REs (“the heated one”). Middle row: a reflection prompt from humans can guide the LLM to resolve the implicit REs and identify the correct object. Bottom row: Comparison among different prompting methods, including aware prompt (AP), chain-of-thought (Co… view at source ↗

**Figure 4.** Figure 4: Success rate (%) of three task planner frameworks, SayCan, DAG-Plan, and HPE, using [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: Failure example on “Mixed REs & Short Context”, using LLaMA3.1-8B+SayCan. Due to [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Success rate (%) of two task planner frameworks (SayCan, DAG-Plan, HPE, and LLM+P [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Success rates (%) of various prompting methods applied to LLaMA 3.1-8B, Gemma 2-9B, [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Success case on “Explicit REs & Noised Context” (top) and failure case on “Implicit REs [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Success case on “Explicit REs & Short Context” (top), failure case on “Mixed REs & [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Success case on “Explicit REs & Standard Context” (top) and failure case on “Implicit [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Success case on “Explicit REs & Noised Context” (top) and failure case on “Mixed REs [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: Success case on “Explicit REs & Short Context” (top) and failure case on “Implicit REs & [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

**Figure 14.** Figure 14: Success (top) and failure (bottom) cases on the “Implicit REs & Standard Context” task [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗

read the original abstract

Robot task planning decomposes human instructions into executable action sequences that enable robots to complete a series of complex tasks. Although recent large language model (LLM)-based task planners achieve amazing performance, they assume that human instructions are clear and straightforward. However, real-world users are not experts, and their instructions to robots often contain significant vagueness. Linguists suggest that such vagueness frequently arises from referring expressions (REs), whose meanings depend heavily on dialogue context and environment. This vagueness is even more prevalent among the elderly and children, who are the groups that robots should serve more. This paper studies how such vagueness in REs within human instructions affects LLM-based robot task planning and how to overcome this issue. To this end, we propose the first robot task planning benchmark that systematically models vague REs grounded in pragmatic theory (REI-Bench), where we discover that the vagueness of REs can severely degrade robot planning performance, leading to success rate drops of up to 36.9%. We also observe that most failure cases stem from missing objects in planners. To mitigate the REs issue, we propose a simple yet effective approach: task-oriented context cognition, which generates clear instructions for robots, achieving state-of-the-art performance compared to aware prompts, chains of thought, and in-context learning. By tackling the overlooked issue of vagueness, this work contributes to the research community by advancing real-world task planning and making robots more accessible to non-expert users, e.g., the elderly and children.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

REI-Bench shows vague referring expressions can drop LLM robot planner success by up to 37 percent and offers a mitigation that beats some baselines, but the synthetic vagueness has no clear match to real elderly or child instructions.

read the letter

The paper introduces REI-Bench as the first benchmark that systematically tests how vague referring expressions affect LLM-based robot task planning. It reports success rate drops of up to 36.9 percent, mostly from planners failing to locate objects, and proposes task-oriented context cognition as a fix that generates clearer instructions and outperforms aware prompts, chain-of-thought, and in-context learning in their comparisons.

Referee Report

1 major / 2 minor

Summary. The paper introduces REI-Bench, the first benchmark for LLM-based robot task planning that systematically incorporates vague referring expressions (REs) modeled from pragmatic theory. It reports that such vagueness causes success-rate drops of up to 36.9%, with most failures arising from missing objects, and proposes a task-oriented context cognition method that generates clarified instructions and outperforms aware prompts, chain-of-thought, and in-context learning baselines.

Significance. If the benchmark's synthetic vagueness faithfully reproduces the distribution of REs produced by non-expert users, the work identifies a practically important failure mode for current planners and supplies a lightweight mitigation that could improve accessibility for elderly and child users. The benchmark itself may become a useful testbed for future embodied agents.

major comments (1)

[§3] §3 (REI-Bench construction): The modeling of vague REs is stated to be grounded in pragmatic theory, yet the manuscript supplies neither a user-study corpus of instructions from elderly or children nor any quantitative comparison of ambiguity metrics (candidate referents, context-dependence scores) between the generated set and real data. This assumption is load-bearing for both the reported 36.9% degradation magnitude and the claim that task-oriented context cognition achieves reliable SOTA mitigation in the target deployment scenario.

minor comments (2)

[Table 2] Table 2 and §4.2: the success-rate tables would benefit from explicit reporting of the number of trials per condition and any statistical significance tests on the observed differences.
[§5] §5 (Discussion): the claim that 'most failure cases stem from missing objects' would be strengthened by a breakdown of failure modes with counts rather than a qualitative statement.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their detailed review and constructive comments on our manuscript. We address the major concern regarding the construction of REI-Bench below.

read point-by-point responses

Referee: §3 (REI-Bench construction): The modeling of vague REs is stated to be grounded in pragmatic theory, yet the manuscript supplies neither a user-study corpus of instructions from elderly or children nor any quantitative comparison of ambiguity metrics (candidate referents, context-dependence scores) between the generated set and real data. This assumption is load-bearing for both the reported 36.9% degradation magnitude and the claim that task-oriented context cognition achieves reliable SOTA mitigation in the target deployment scenario.

Authors: We agree that empirical validation with real user data would strengthen the benchmark. Our approach in §3 models vague REs by applying principles from pragmatic theory, such as the use of underspecified referring expressions that depend on shared context and the number of potential referents. This allows for systematic variation in ambiguity levels. However, we did not perform a dedicated user study with elderly or children participants in this work, relying instead on theoretical grounding to generate the test cases. The 36.9% degradation is measured within this controlled synthetic setting, which we believe captures key aspects of real-world vagueness as described in the linguistics literature. Similarly, the task-oriented context cognition method is evaluated on the benchmark and shows improvements over baselines. We will revise the manuscript to include a more explicit discussion of the modeling assumptions and a limitations section addressing the lack of direct real-data validation. revision: partial

standing simulated objections not resolved

The lack of a user-study corpus from elderly or children and quantitative comparison of ambiguity metrics to real data.

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark with independent evaluations

full rationale

The paper introduces REI-Bench by modeling vague referring expressions according to pragmatic theory, then runs direct experiments on LLM-based planners to measure success-rate degradation and compares a proposed task-oriented context cognition method against prompting baselines. All reported outcomes (36.9% drop, failure-mode observations, SOTA mitigation) are obtained from these external comparisons and observations rather than any self-defined quantities, fitted parameters renamed as predictions, or load-bearing self-citations. The derivation chain remains self-contained against the benchmark data and standard baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that pragmatic theory provides a valid model for vagueness in human-robot instructions and that LLM planners are the relevant baseline class; no free parameters or new entities are introduced.

axioms (1)

domain assumption Human instructions to robots frequently contain vagueness arising from referring expressions whose meanings depend on dialogue context and environment.
Stated directly in the abstract as the motivation and basis for the benchmark.

pith-pipeline@v0.9.0 · 5814 in / 1233 out tokens · 38281 ms · 2026-05-22T15:19:09.126791+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose the first robot task planning benchmark that systematically models vague REs grounded in pragmatic theory (REI-Bench)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning
cs.AI 2026-05 unverdicted novelty 5.0

TaskGround introduces a Ground-Infer-Execute framework for full-scene household reasoning that improves success rates on the FullHome benchmark and enables compact models to match larger ones at up to 18x lower token cost.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Jae-Woo Choi, Youngwoo Yoon, Hyobin Ong, Jaehong Kim, and Minsu Jang

arXiv preprint arXiv:2311.15649. Jae-Woo Choi, Youngwoo Yoon, Hyobin Ong, Jaehong Kim, and Minsu Jang. Lota-bench: Benchmarking language-oriented task planners for embodied agents, 2024. arXiv preprint arXiv:2402.08178. Herbert H Clark. Bridging.Theoretical issues in natural language processing/Association for Computing Machinery, 1975. Fethiye Irmak Do ˘...

work page arXiv 2024
[2]

An Embodied Generalist Agent in 3D World

arXiv preprint arXiv:2311.12871. Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. InICML, pp. 9118–9147, 2022a. Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pi...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Please have Alice state the request mentioned above only in the last sentence and refrain from making any other requests

The dialogue content of {Seed Instruction} should be included in Alice’s final instruction. Please have Alice state the request mentioned above only in the last sentence and refrain from making any other requests

work page
[4]

Before making this request, Alice should mention some other requirements related to the {REs}

work page
[5]

Each character’s lines should contain no fewer than 20 words, and no actions should be included for any character

There should be six rounds of dialogue before this request. Each character’s lines should contain no fewer than 20 words, and no actions should be included for any character

work page
[6]

Please do not output anything other than the dialogue

work page
[7]

28 Published as a conference paper at ICLR 2026 Context Memory Generation Example Seed Instruction: Put a cooked tomato into the refrigerator

Please try to retain the words of{REs} themselves in the conversation, rather than replacing them with pronouns like ”it.” Below, I will give you an example:{Example} We used this prompt to expand a seed instruction into a full dialogue, which is shown below. 28 Published as a conference paper at ICLR 2026 Context Memory Generation Example Seed Instructio...

work page 2026
[8]

Please add content only within the dialogue without deleting any existing content or changing the order of the dialogue

work page
[9]

Please do not change the number of turns in the dialogue.Please do not change the structure of the dialogue

work page
[10]

{REs}” in the sentence “{Seed Instruction}

Please ensure the fluency of the dialogue. Please follow the requirements below for the adaptation. Associated Name Background:There is another member (a human) of the family named {Ambiguous Name}. Please have Alice mention him 3 times when discussing anything related to THE REFERENCE, but without changing the existing meaning of the conversation. Some d...

work page 2026
[11]

Please output the whole new dialogue

work page
[12]

You must output the whole new dialogue, including all the sentences from Alice and Robot

work page
[13]

{REs}” in the previous text, except for replacing “{REs}

Please retain every instance of “{REs}” in the previous text, except for replacing “{REs}” in the last sentence spoken by Alice

work page
[14]

explicit REs & standard context,

You need to output the complete multi-turn dialogue, including the multiple turns of language from both Alice and the robot. Here is an example:{Example} Here is the dialogue:{Dialogue} Output: 30 Published as a conference paper at ICLR 2026 Context Memory Processing Human: Hey there, I’ve been thinking about what to do with the tomatoes we have. I really...

work page 2026

[1] [1]

Jae-Woo Choi, Youngwoo Yoon, Hyobin Ong, Jaehong Kim, and Minsu Jang

arXiv preprint arXiv:2311.15649. Jae-Woo Choi, Youngwoo Yoon, Hyobin Ong, Jaehong Kim, and Minsu Jang. Lota-bench: Benchmarking language-oriented task planners for embodied agents, 2024. arXiv preprint arXiv:2402.08178. Herbert H Clark. Bridging.Theoretical issues in natural language processing/Association for Computing Machinery, 1975. Fethiye Irmak Do ˘...

work page arXiv 2024

[2] [2]

An Embodied Generalist Agent in 3D World

arXiv preprint arXiv:2311.12871. Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. InICML, pp. 9118–9147, 2022a. Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pi...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Please have Alice state the request mentioned above only in the last sentence and refrain from making any other requests

The dialogue content of {Seed Instruction} should be included in Alice’s final instruction. Please have Alice state the request mentioned above only in the last sentence and refrain from making any other requests

work page

[4] [4]

Before making this request, Alice should mention some other requirements related to the {REs}

work page

[5] [5]

Each character’s lines should contain no fewer than 20 words, and no actions should be included for any character

There should be six rounds of dialogue before this request. Each character’s lines should contain no fewer than 20 words, and no actions should be included for any character

work page

[6] [6]

Please do not output anything other than the dialogue

work page

[7] [7]

28 Published as a conference paper at ICLR 2026 Context Memory Generation Example Seed Instruction: Put a cooked tomato into the refrigerator

Please try to retain the words of{REs} themselves in the conversation, rather than replacing them with pronouns like ”it.” Below, I will give you an example:{Example} We used this prompt to expand a seed instruction into a full dialogue, which is shown below. 28 Published as a conference paper at ICLR 2026 Context Memory Generation Example Seed Instructio...

work page 2026

[8] [8]

Please add content only within the dialogue without deleting any existing content or changing the order of the dialogue

work page

[9] [9]

Please do not change the number of turns in the dialogue.Please do not change the structure of the dialogue

work page

[10] [10]

{REs}” in the sentence “{Seed Instruction}

Please ensure the fluency of the dialogue. Please follow the requirements below for the adaptation. Associated Name Background:There is another member (a human) of the family named {Ambiguous Name}. Please have Alice mention him 3 times when discussing anything related to THE REFERENCE, but without changing the existing meaning of the conversation. Some d...

work page 2026

[11] [11]

Please output the whole new dialogue

work page

[12] [12]

You must output the whole new dialogue, including all the sentences from Alice and Robot

work page

[13] [13]

{REs}” in the previous text, except for replacing “{REs}

Please retain every instance of “{REs}” in the previous text, except for replacing “{REs}” in the last sentence spoken by Alice

work page

[14] [14]

explicit REs & standard context,

You need to output the complete multi-turn dialogue, including the multiple turns of language from both Alice and the robot. Here is an example:{Example} Here is the dialogue:{Dialogue} Output: 30 Published as a conference paper at ICLR 2026 Context Memory Processing Human: Hey there, I’ve been thinking about what to do with the tomatoes we have. I really...

work page 2026