arxiv: 2605.08406 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Effective Explanations Support Planning Under Uncertainty

Hanqi Zhou , Britt Besch , Charley M. Wu , Tobias Gerstenberg

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:55 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords explanationsplanning under uncertaintylanguage groundingcomputational modelhuman navigationpolicy priorvalue mappartial observability

0 comments

The pith

Explanations that translate into efficient, reliable action plans under uncertainty help people navigate better than poor ones or none at all.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

People give directions or explanations assuming the listener can turn words into workable plans even when the environment is only partly known. This paper builds a model that first has a large language model turn an explanation into a policy prior and value map, then has a planner execute paths from it while tracking how often replanning is needed. Explanations earn higher scores when the resulting paths are short and stable. Across experiments, people judge these high-scoring explanations as more helpful, and participants who receive them reach their goals faster and more reliably than those given low-scoring explanations or no guidance. The work frames good explanation as communication shaped by its downstream effect on planning under partial information.

Core claim

The paper establishes that an explanation's quality can be measured by converting it, via large language model, into a policy prior and value map that a planning agent then follows under partial observability; the resulting paths are scored for efficiency and reliability, with penalties for replanning. Higher-scoring explanations receive higher human helpfulness ratings. In navigation tasks, participants given high-scoring explanations outperform both those given no explanations and those given low-scoring ones.

What carries the argument

The pipeline that uses a large language model to translate an utterance into a policy prior and value map, then runs a planner under partial observability to produce scored paths.

If this is right

Participants rate higher-scored explanations as more helpful than lower-scored ones.
People complete navigation tasks more successfully when given explanations than when given none.
High-scoring explanations produce better navigation outcomes than low-scoring explanations.
Explanations that force frequent replanning receive lower scores and less human approval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The scoring method could be used to automatically generate or refine explanations optimized for a listener's planning needs.
It offers an objective way to evaluate explanation quality in domains like teaching or AI assistance without collecting new human ratings each time.
The same grounding approach might apply to other forms of communication where one party must support another's decisions under incomplete information.

Load-bearing premise

The large language model's conversion of an explanation into policy and value guidance accurately captures what a human listener would extract and use for planning.

What would settle it

Running the same navigation tasks with new participants and checking whether the model's path-efficiency scores still predict measured human success rates and helpfulness ratings would test the claim.

Figures

Figures reproduced from arXiv: 2605.08406 by Britt Besch, Charley M. Wu, Hanqi Zhou, Tobias Gerstenberg.

**Figure 1.** Figure 1: Modeling language-guided navigation and experimental pipeline. (a) Explanation collection (Exp. 1): An explainer with full knowledge of the environment generates natural-language explanations for an explainee acting under partial observability. (b) Explanation modeling and selection: Free-form text explanations are translated by LLMs into symbolic programs and evaluated by a simulated agent that plans an… view at source ↗

**Figure 2.** Figure 2: Paired map examples and behavioral effects of explanation quality. (a) Example map pairs: matched overall layouts with small local changes (obstacles/structure). (b) Path length (top) and helpfulness ratings (bottom) by condition (None, Bad, Medium, Good) for each map pair (columns). Bars show means ± SE; dots show individual participants. Higherquality explanations consistently increase perceived helpful… view at source ↗

**Figure 3.** Figure 3: Explanations improve navigation efficiency and subjective helpfulness. (a) Example trajectories by condition: no explanation yields inefficient exploration; low-quality explanations are vague procedural; high-quality explanations emphasize relevant landmarks and structure, producing more direct paths. (b) Path length by condition (No/Bad/Medium/Good): higher quality yields shorter paths and higher helpfuln… view at source ↗

**Figure 5.** Figure 5: Policy vs. value content in explanations. (a) Navigation path length for explanations that contain policy guidance vs. those that do not. (b) Mean helpfulness ratings by map difficulty for explanations containing policy+value information, policy-only information, or value-only information. Error bars show standard errors. only: As a simple processing heuristic, we scored explanations by negative word … view at source ↗

read the original abstract

Explaining how to get from A to B can be challenging. It requires mentally simulating what the listener will do based on what they are told. To capture this process, we propose a computational model that converts utterances into action plans: a large language model translates an explanation into program-like guidance (a policy prior and value map), and a planning agent executes it under partial observability. We score explanations by the efficiency and reliability of the resulting paths, penalizing replanning. Across four preregistered experiments, we collect a corpus of 1,200 explanations over 24 maps, elicit helpfulness judgments, measure baseline navigation, and test behavior with explanations of differing quality. Higher-scored explanations are judged more helpful and improve navigation: participants with explanations outperform those without, and high-scoring explanations help more than low-scoring ones. Together, these results show procedural explanation as utility-guided communication shaped by how language can be grounded into action under uncertainty.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper scores explanations by turning them into LLM-derived policy priors and value maps then measuring replanning-penalized efficiency in a planning agent, with experiments showing higher scores predict better human navigation.

read the letter

The paper's main contribution is a pipeline that turns explanations into policy priors and value maps using an LLM, then scores them by how efficiently a planning agent can follow them under partial observability, with penalties for replanning. Experiments back this up by showing that high-scoring explanations lead to better human navigation and higher helpfulness ratings. It does a few things right. The four preregistered experiments with a 1200-explanation corpus provide direct tests of both judgments and actual performance. Collecting baseline navigation without explanations and comparing to with-explanation conditions is a clean way to measure improvement. The separation of data collection for scoring and for behavioral validation reduces circularity. The weakest part is the assumption that the LLM's output faithfully represents the guidance a human would take from the explanation. The abstract and stress test note don't show any direct comparison between the agent's paths and human trajectories or mental models on the same items. Helpfulness judgments correlate with scores, but that leaves open whether the intermediate policy and value representations match real human planning. This work is for people studying how language supports action in uncertain settings, like in AI assistants or educational tools. A reader interested in computational models of explanation or human-AI collaboration would get value from the approach and the data. I would recommend sending it to peer review. The experimental setup is strong enough to deserve referee input, though the modeling link to human cognition could use more validation.

Referee Report

2 major / 2 minor

Summary. The paper proposes a computational model that scores explanations for navigation tasks by using an LLM to convert utterances into policy priors and value maps, which a planning agent then executes under partial observability; explanations are scored by the resulting paths' efficiency and reliability (penalizing replanning). Four preregistered experiments collect 1,200 explanations over 24 maps, elicit helpfulness judgments, and test navigation performance, finding that higher-scored explanations are judged more helpful and yield better human navigation outcomes than lower-scored explanations or no explanations.

Significance. If the results hold, the work provides a principled, utility-based framework for evaluating procedural explanations by grounding them in planning under uncertainty, with potential applications in AI explanation systems and human-AI communication. Credit is due for the four preregistered experiments, the 1,200-explanation corpus, and direct behavioral measures of navigation performance rather than relying solely on judgments.

major comments (2)

[§3] §3 (model description): The central claim that higher-scored explanations improve human navigation requires that the LLM-derived policy prior and value map faithfully capture the guidance humans extract and apply under partial observability. No direct validation is reported comparing the agent's planned paths, replanning frequency, or efficiency metrics to human trajectories or mental simulations on the same explanations; correlation with post-hoc helpfulness judgments (Experiment 3) does not test this intermediate representation.
[§4] Methods (§4): Insufficient detail is provided on LLM prompt engineering for converting explanations to policy priors/value maps and on whether planning-agent parameters (e.g., replanning penalty, observability settings) were tuned post-hoc. This affects reproducibility and raises the possibility that the scoring function was optimized on the same data used to claim predictive success for human behavior.

minor comments (2)

Figure captions and legends could more explicitly label the baseline (no-explanation) condition versus explanation conditions to aid quick comparison of navigation performance metrics.
[Abstract] The abstract states the headline results but omits any mention of the 24 maps or corpus size; moving a brief quantitative summary of the experimental design into the abstract would improve accessibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and have revised the paper accordingly to improve clarity, reproducibility, and transparency.

read point-by-point responses

Referee: [§3] §3 (model description): The central claim that higher-scored explanations improve human navigation requires that the LLM-derived policy prior and value map faithfully capture the guidance humans extract and apply under partial observability. No direct validation is reported comparing the agent's planned paths, replanning frequency, or efficiency metrics to human trajectories or mental simulations on the same explanations; correlation with post-hoc helpfulness judgments (Experiment 3) does not test this intermediate representation.

Authors: We agree that direct validation of the intermediate representations would strengthen the link between the computational model and human planning processes. Our primary evidence for the model's utility remains the behavioral outcomes: higher-scored explanations yield higher helpfulness ratings (Experiment 3) and measurably better navigation performance under uncertainty (Experiment 4). These results indicate that the scoring function identifies explanations that support effective human action, even if the precise internal representations are not directly compared. We have added a dedicated limitations paragraph in the revised discussion that acknowledges the absence of trajectory-level comparisons and outlines planned follow-up work to collect human navigation paths for such validation. revision: partial
Referee: [§4] Methods (§4): Insufficient detail is provided on LLM prompt engineering for converting explanations to policy priors/value maps and on whether planning-agent parameters (e.g., replanning penalty, observability settings) were tuned post-hoc. This affects reproducibility and raises the possibility that the scoring function was optimized on the same data used to claim predictive success for human behavior.

Authors: We have substantially expanded the Methods section and added a new appendix containing the exact LLM prompts used to derive policy priors and value maps. We also now explicitly state that all planning-agent parameters (including replanning penalty and observability settings) were fixed in advance based on the task environment and pilot data collected prior to the main experiments. No parameters were tuned on the 1,200-explanation corpus or the human evaluation data, preserving the preregistered separation between model specification and evaluation. revision: yes

Circularity Check

0 steps flagged

No significant circularity: scoring model is independent of human outcome data

full rationale

The paper defines a computational scoring procedure (LLM translation of utterance to policy prior + value map, followed by planning-agent execution under partial observability, scored on efficiency/reliability with replanning penalty) and applies it to a separately collected corpus of 1,200 explanations. Human helpfulness judgments and navigation performance are measured in independent preregistered experiments. No equations or text indicate that the scoring function, LLM prompts, or planner parameters were fitted to the human data; the model is proposed a priori and then correlated with external behavioral measures. No self-citation chains, self-definitional loops, or fitted-input-as-prediction patterns appear in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The model rests on the assumption that an LLM can reliably extract planning-relevant structure from natural language explanations and that path efficiency plus replanning cost is a valid proxy for human utility; no explicit free parameters or invented entities are named in the abstract.

axioms (2)

domain assumption LLM outputs can be treated as faithful policy priors and value maps for downstream planning
Invoked when the model converts utterances into program-like guidance
domain assumption Efficiency and reliability of resulting paths (with replanning penalty) measure explanation quality
Central to the scoring procedure described in the abstract

pith-pipeline@v0.9.0 · 5460 in / 1321 out tokens · 32466 ms · 2026-05-12T00:55:59.520431+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 1 internal anchor

[1]

and Ying, Lance and Zhang, Cedegao E

Wong, Lionel and Collins, Katherine M. and Ying, Lance and Zhang, Cedegao E. and Weller, Adrian and Gerstenberg, Tobias and O'Donnell, Timothy and Lew, Alexander K. and Andreas, Jacob D. and Tenenbaum, Joshua B. and Brooke-Wilson, Tyler , date-added =. arXiv , title =. 2025 , bdsk-url-1 =

work page 2025
[2]

The Rational Speech Act Framework , volume =

Degen, Judith , date-added =. The Rational Speech Act Framework , volume =. Annual Review of Linguistics , pages =

work page
[3]

Reconciling truthfulness and relevance as epistemic and decision-theoretic utility , year =

Sumers, Theodore R and Ho, Mark K and Griffiths, Thomas L and Hawkins, Robert D , date-added =. Reconciling truthfulness and relevance as epistemic and decision-theoretic utility , year =. Psychological Review , publisher =

work page
[4]

arXiv , title =

Jacqueline Harding and Tobias Gerstenberg and Thomas Icard , date-added =. arXiv , title =. 2025 , bdsk-url-1 =

work page 2025
[5]

Explanation and categorization: How ``why?'' informs ``what?'' , volume =

Lombrozo, Tania , journal =. Explanation and categorization: How ``why?'' informs ``what?'' , volume =

work page
[6]

Can language models teach weaker agents? teacher explanations improve students via theory of mind , year =

Saha, Swarnadeep and Hase, Peter and Bansal, Mohit , journal =. Can language models teach weaker agents? teacher explanations improve students via theory of mind , year =

work page
[7]

Adaptive mechanisms of social and asocial learning in immersive foraging environments , volume =

Wu, Charley M and Deffner, Dominik and Kahl, Benjamin and Meder, Bj. Adaptive mechanisms of social and asocial learning in immersive foraging environments , volume =. 2025 , bdsk-url-1 =. doi:10.1038/s41467-025-58365-6 , journal =

work page doi:10.1038/s41467-025-58365-6 2025
[8]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Do as i can, not as i say: Grounding language in robotic affordances , author=. arXiv preprint arXiv:2204.01691 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

arXiv preprint arXiv:2512.03381 , year=

Characterizing Language Use in a Collaborative Situated Game , author=. arXiv preprint arXiv:2512.03381 , year=

work page arXiv
[10]

Artificial intelligence , volume=

Planning and acting in partially observable stochastic domains , author=. Artificial intelligence , volume=. 1998 , publisher=

work page 1998
[11]

arXiv preprint arXiv:2509.00074 , title =

Colas, C. arXiv preprint arXiv:2509.00074 , title =

work page arXiv
[12]

The efficiency of human cognition reflects planned information processing , year =

Ho, Mark K and Abel, David and Cohen, Jonathan D and Littman, Michael L and Griffiths, Thomas L , booktitle =. The efficiency of human cognition reflects planned information processing , year =

work page
[13]

and Tejero-Cantero, Alvaro , booktitle =

Zhou, Hanqi and Bamler, Robert and Wu, Charley M. and Tejero-Cantero, Alvaro , booktitle =. Predictive, scalable and interpretable knowledge tracing on structured domains , year =. doi:10.48550/arXiv.2403.13179 , keywords =

work page doi:10.48550/arxiv.2403.13179
[14]

Agent-centric learning: from external reward maximization to internal knowledge curation , year =

Zhou, Hanqi and Mantiuk, Fryderyk and Nagy, David G and Wu, Charley M , booktitle =. Agent-centric learning: from external reward maximization to internal knowledge curation , year =. doi:10.48550/arXiv.2507.22255 , keywords =

work page doi:10.48550/arxiv.2507.22255
[15]

Are machine rationales (not) useful to humans? measuring and improving human utility of free-text rationales , year =

Joshi, Brihi and Liu, Ziyi and Ramnath, Sahana and Chan, Aaron and Tong, Zhewei and Nie, Shaoliang and Wang, Qifan and Choi, Yejin and Ren, Xiang , journal =. Are machine rationales (not) useful to humans? measuring and improving human utility of free-text rationales , year =

work page
[16]

Cooperative explanation as rational communication , volume =

Chandra, Kartik and Chen, Tony and Li, Tzu-Mao and Ragan-Kelley, Jonathan and Tenenbaum, Josh , booktitle =. Cooperative explanation as rational communication , volume =

work page
[17]

Improving route directions: The role of intersection type and visual clutter for spatial reference , volume =

Baltaretu, Adriana and Krahmer, Emiel and Maes, Alfons , journal =. Improving route directions: The role of intersection type and visual clutter for spatial reference , volume =

work page
[18]

Perspectives on human spatial cognition: memory, navigation, and environmental learning , volume =

Denis, Michel and Loomis, Jack M , journal =. Perspectives on human spatial cognition: memory, navigation, and environmental learning , volume =

work page
[19]

Wayfinding through orientation , volume =

Schwering, Angela and Krukar, Jakub and Li, Rui and Anacta, Vanessa Joy and Fuest, Stefan , journal =. Wayfinding through orientation , volume =

work page
[20]

Speaker-follower models for vision-and-language navigation , volume =

Fried, Daniel and Hu, Ronghang and Cirik, Volkan and Rohrbach, Anna and Andreas, Jacob and Morency, Louis-Philippe and Berg-Kirkpatrick, Taylor and Saenko, Kate and Klein, Dan and Darrell, Trevor , journal =. Speaker-follower models for vision-and-language navigation , volume =

work page
[21]

Teachers recruit mentalizing regions to represent learners' beliefs , volume =

V. Teachers recruit mentalizing regions to represent learners' beliefs , volume =. Proceedings of the National Academy of Sciences , number =

work page
[22]

Learning to faithfully rationalize by construction , year =

Jain, Sarthak and Wiegreffe, Sarah and Pinter, Yuval and Wallace, Byron C , journal =. Learning to faithfully rationalize by construction , year =

work page
[23]

Learning to interpret natural language navigation instructions from observations , volume =

Chen, David and Mooney, Raymond , booktitle =. Learning to interpret natural language navigation instructions from observations , volume =

work page
[24]

A hierarchical bayesian model of adaptive teaching , volume =

Chen, Alicia M and Palacci, Andrew and V. A hierarchical bayesian model of adaptive teaching , volume =. Cognitive science , number =

work page
[25]

Learning how to generalize , volume =

Austerweil, Joseph L and Sanborn, Sophia and Griffiths, Thomas L , journal =. Learning how to generalize , volume =

work page
[26]

A rational account of pedagogical reasoning: Teaching by, and learning from, examples , volume =

Shafto, Patrick and Goodman, Noah D and Griffiths, Thomas L , journal =. A rational account of pedagogical reasoning: Teaching by, and learning from, examples , volume =

work page
[27]

Reconciling truthfulness and relevance as epistemic and decision-theoretic utility

Sumers, Theodore R and Ho, Mark K and Griffiths, Thomas L and Hawkins, Robert D , journal =. Reconciling truthfulness and relevance as epistemic and decision-theoretic utility. , volume =

work page
[28]

Pragmatic language interpretation as probabilistic inference , volume =

Goodman, Noah D and Frank, Michael C , journal =. Pragmatic language interpretation as probabilistic inference , volume =

work page
[29]

Predicting pragmatic reasoning in language games , volume =

Frank, Michael C and Goodman, Noah D , journal =. Predicting pragmatic reasoning in language games , volume =

work page
[30]

Logic and conversation , year =

Grice, Herbert P , booktitle =. Logic and conversation , year =

work page
[31]

Inference from explanation

Kirfel, Lara and Icard, Thomas and Gerstenberg, Tobias , journal =. Inference from explanation. , volume =

work page
[32]

14 Explanation and Abductive Inference , year =

Lombrozo, Tania , journal =. 14 Explanation and Abductive Inference , year =

work page
[33]

and Billman, D

Chalnick, A. and Billman, D. , booktitle =

work page
[34]

Feigenbaum, E. A. , booktitle =

work page
[35]

Hill, J. A. C. , date =. A Computational Model of Language Acquisition in the Two-Year Old , volume =

work page
[36]

and Langley, P

Ohlsson, S. and Langley, P. , date =

work page
[37]

How Real Is Fictive Motion? , type =

Matlock, Teenie , date =. How Real Is Fictive Motion? , type =

work page
[38]

, date =

Lewis, C. , date =. Production System Models of Practice Effects , type =

work page
[39]

and Simon, H

Newell, A. and Simon, H. A. , date =

work page