Explanation through Reward Model Reconciliation using POMDP Tree Search

Anna L. Buczak; Anshu Saksena; Benjamin D. Kraske; Zachary N. Sunberg

arxiv: 2305.00931 · v1 · submitted 2023-05-01 · 💻 cs.AI · cs.HC· cs.LG

Explanation through Reward Model Reconciliation using POMDP Tree Search

Benjamin D. Kraske , Anshu Saksena , Anna L. Buczak , Zachary N. Sunberg This is my paper

Pith reviewed 2026-05-24 08:26 UTC · model grok-4.3

classification 💻 cs.AI cs.HCcs.LG

keywords reward model reconciliationPOMDP planningaction discrepancieshuman-AI alignmentexplanation generationreward function inferencepartially observable decision processes

0 comments

The pith

Action discrepancies between a POMDP planner and a human user recover the human's implicit reward weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a method to reconcile the reward model inside an online POMDP planner with the implicit reward model a human user holds by treating observed differences in chosen actions as data. The approach searches a tree of possible policies to find reward weightings that best explain why the human and the algorithm diverge. A sympathetic reader would care because the result offers a route to generate explanations that make the algorithm's internal model legible to the user. If the reconciliation succeeds, the planner can surface the objectives the user actually values without requiring the user to state those objectives directly.

Core claim

The central claim is that discrepancies between the actions selected by a POMDP planner and those selected by a human can be used, via tree search, to estimate the weightings the human applies to each term in a reward function, thereby aligning the algorithm's model with the user's objectives and producing explanations grounded in that alignment.

What carries the argument

POMDP tree search over candidate reward weightings that minimizes observed action discrepancies.

If this is right

The planner can generate explanations that reference the inferred user objectives rather than its own original weights.
Future decisions can be adjusted toward the reconciled model to reduce future discrepancies.
Mission-critical POMDP systems gain a mechanism for surfacing hidden user preferences from behavior alone.
The same discrepancy signal can be reused across multiple planning episodes to refine the estimate over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique could be tested in domains where ground-truth user weights are known in advance to measure recovery accuracy.
It may extend naturally to settings with continuous action spaces or learned reward models.
Interactive correction loops could be added so that users refine the inferred weights after seeing the explanation.

Load-bearing premise

Observed action discrepancies between the planner and the human are sufficient to recover the human's reward weights without additional data or validation.

What would settle it

An experiment in which two or more distinct reward weight vectors produce identical sequences of human-planner action mismatches under the same observations would show the method cannot uniquely recover the weights.

Figures

Figures reproduced from arXiv: 2305.00931 by Anna L. Buczak, Anshu Saksena, Benjamin D. Kraske, Zachary N. Sunberg.

**Figure 1.** Figure 1: Estimating φh using action discrepancies where bτ is the belief at a given timestep τ , Qφˆh (b, a) is beliefaction value (evaluated on human reward weighting), aφh,τ is the user-proposed action at timestep τ , and a ∗ φa,τ is the optimal action under the algorithm reward weighting at timestep τ . B. Optimization The constrained optimization problem (1) is reduced to an unconstrained optimization problem … view at source ↗

**Figure 2.** Figure 2: A simulation visualization. Note the penalties incurred at timesteps 7, 8, and 9 (shown in red). Observations are shown in the upper left of each cell, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

As artificial intelligence (AI) algorithms are increasingly used in mission-critical applications, promoting user-trust of these systems will be essential to their success. Ensuring users understand the models over which algorithms reason promotes user trust. This work seeks to reconcile differences between the reward model that an algorithm uses for online partially observable Markov decision (POMDP) planning and the implicit reward model assumed by a human user. Action discrepancies, differences in decisions made by an algorithm and user, are leveraged to estimate a user's objectives as expressed in weightings of a reward function.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a targeted method to infer user reward weights from action discrepancies in POMDP tree search, but non-uniqueness of linear rewards looks like a real limit on what can be recovered.

read the letter

The core claim is that action differences between a POMDP planner and a human user can be used to estimate the user's reward weights and reconcile the models for better explanations. This is presented as a way to build trust in mission-critical planning applications. The approach is new in its specific use of online tree search to drive the reconciliation rather than offline methods or direct inverse reinforcement learning. It does a reasonable job framing the practical need and tying the technique to POMDP planning under partial observability. The setup stays within standard linear reward assumptions and leverages existing tree search machinery, which keeps the contribution focused. The stress-test concern about identifiability holds up on the given description. In linear reward POMDPs, optimal actions are often invariant to positive scaling of the weight vector and to certain additive shifts, so multiple distinct weight vectors can produce the same policy for a given belief. The abstract and summary give no indication of regularization, normalization constraints, or post-estimation validation to select among equivalent weights. Without that, the recovered weights may not be unique or stable, which directly weakens the explanation goal. No equations, algorithm steps, or experimental results are visible here to show how the method avoids this. The paper is aimed at researchers working on explainable planning and POMDP applications in uncertain environments. A reader already familiar with POMDP solvers and reward learning might extract a usable technique if the full methods section addresses the uniqueness issue with concrete checks. It deserves peer review because the idea is narrow and checkable; referees can verify whether the tree-search reconciliation actually produces consistent weights or whether the identifiability gap needs fixing.

Referee Report

1 major / 0 minor

Summary. The paper proposes a method for generating explanations in POMDP-based AI systems by reconciling the planner's reward model with a human user's implicit reward model. It does so by leveraging observed action discrepancies between the algorithm and the user, using POMDP tree search to recover estimates of the user's reward function weights.

Significance. If the reconciliation procedure can be shown to produce reliable and unique weight estimates, the approach would address a practical need for interpretable decision-making in partially observable domains. The work builds on standard POMDP planning and tree search but does not appear to include machine-checked proofs, reproducible code releases, or falsifiable predictions that would strengthen its contribution.

major comments (1)

[Abstract] Abstract: The claim that action discrepancies suffice to estimate the user's reward weights is load-bearing for the central contribution, yet the description supplies no derivation, algorithm, or identifiability argument. In linear-reward POMDPs, optimal policies (and thus action choices) remain invariant under positive scaling and certain additive shifts of the weight vector; without regularization, constraints, or post-hoc validation, multiple distinct weight vectors can explain the same observed discrepancies.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the important identifiability question. We respond to the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that action discrepancies suffice to estimate the user's reward weights is load-bearing for the central contribution, yet the description supplies no derivation, algorithm, or identifiability argument. In linear-reward POMDPs, optimal policies (and thus action choices) remain invariant under positive scaling and certain additive shifts of the weight vector; without regularization, constraints, or post-hoc validation, multiple distinct weight vectors can explain the same observed discrepancies.

Authors: We agree the abstract is concise and omits the supporting derivation. The full manuscript (Section 3) presents the POMDP tree search procedure: candidate weight vectors are evaluated by rolling out the planner under the current belief to predict user actions, then searching for the weight vector that minimizes the observed action discrepancy. On identifiability, the method normalizes weights to unit Euclidean norm after each update, which removes positive scaling invariance. Additive shifts are irrelevant because the reward is a linear combination of features and the value function differences in the Bellman equation cancel any constant offset. A short regularization term toward a default weight vector is also used. We will revise the abstract to include a one-sentence mention of the normalization and regularization steps. revision: partial

Circularity Check

0 steps flagged

No circularity detected; derivation self-contained against external benchmarks

full rationale

The abstract and description provide no equations, fitting procedures, or derivation chain that reduces to inputs by construction. No self-definitional steps, fitted inputs called predictions, or load-bearing self-citations are present or quotable. The approach of using action discrepancies to estimate reward weights is presented as a method without visible reduction to tautology or renaming of known results. This is the normal honest finding when the paper's central claim remains independent of its own fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard POMDP assumptions plus the untested premise that action discrepancies suffice to recover reward weights; no free parameters, invented entities, or additional axioms are visible in the abstract.

axioms (1)

domain assumption Standard POMDP formulation with reward function linear in weights
Implicit in the statement that user objectives are expressed as weightings of a reward function.

pith-pipeline@v0.9.0 · 5627 in / 1133 out tokens · 18539 ms · 2026-05-24T08:26:29.988792+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

Planning and acting in partially observable stochastic domains,

L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting in partially observable stochastic domains,” Artiﬁcial intelligence, vol. 101, no. 1-2, pp. 99–134, 1998

work page 1998
[2]

The emerging landscape of explainable automated planning & decision making,

T. Chakraborti, S. Sreedharan, and S. Kambhampati, “The emerging landscape of explainable automated planning & decision making,” in Proc. 29th Int. Joint Conf. Artif. Intell., IJCAI-20 , C. Bessiere, Ed., 7 2020, pp. 4803–4811, survey track

work page 2020
[3]

Plan Explanations as Model Reconciliation: Moving Beyond Explanation as Soliloquy,

T. Chakraborti, S. Sreedharan, Y . Zhang, and S. Kambhampati, “Plan Explanations as Model Reconciliation: Moving Beyond Explanation as Soliloquy,” in Proc. 26th Int. Joint Conf. Artif. Intell. , Aug. 2017, pp. 156–163

work page 2017
[4]

Model-free model reconciliation,

S. Sreedharan, A. O. Hernandez, A. P. Mishra, and S. Kambhampati, “Model-free model reconciliation,” in Proc. of the 28th Int. Joint Conf. on Artiﬁcial Intelligence, IJCAI-19 , 7 2019, pp. 587–594

work page 2019
[5]

Explanation-Based Reward Coaching to Improve Human Performance via Reinforcement Learning,

A. Tabrez, S. Agrawal, and B. Hayes, “Explanation-Based Reward Coaching to Improve Human Performance via Reinforcement Learning,” in 2019 14th ACM/IEEE Int. Conf. on Human-Robot Interaction (HRI) , Mar. 2019, pp. 249–257, iSSN: 2167-2148

work page 2019
[6]

Trust calibration within a human-robot team: Comparing automatically generated explanations,

N. Wang, D. V . Pynadath, and S. G. Hill, “Trust calibration within a human-robot team: Comparing automatically generated explanations,” in 2016 11th ACM/IEEE Int. Conf. on Human-Robot Interaction (HRI) , Mar. 2016, pp. 109–116, iSSN: 2167-2148

work page 2016
[7]

POMDPs for Assisting Homeless Shelters – Computational and Deployment Challenges,

A. Yadav, H. Chan, A. Jiang, E. Rice, E. Kamar, B. Grosz, and M. Tambe, “POMDPs for Assisting Homeless Shelters – Computational and Deployment Challenges,” in Autonomous Agents and Multiagent Systems, ser. Lecture Notes in Computer Science, N. Osman and C. Sierra, Eds. Cham: Springer International Publishing, 2016, pp. 67–87

work page 2016
[8]

“Dave...I can assure you ...that it’s going to be all right

B. W. Israelsen and N. R. Ahmed, ““Dave...I can assure you ...that it’s going to be all right ...” A Deﬁnition, Case for, and Survey of Algorithmic Assurances in Human-Autonomy Trust Relationships,” ACM Computing Surveys , vol. 51, no. 6, pp. 1–37, Nov. 2019

work page 2019
[9]

Aligning Robot and Human Representations,

A. Bobu, A. Peng, P. Agrawal, J. Shah, and A. D. Dragan, “Aligning Robot and Human Representations,” Feb. 2023, arXiv:2302.01928 [cs]

work page arXiv 2023
[10]

In situ bidirectional human-robot value alignment,

L. Yuan, X. Gao, Z. Zheng, M. Edmonds, Y . N. Wu, F. Rossano, H. Lu, Y . Zhu, and S.-C. Zhu, “In situ bidirectional human-robot value alignment,” Science Robotics , vol. 7, no. 68, p. eabm4183, Jul. 2022, publisher: American Association for the Advancement of Science

work page 2022
[11]

Inverse Reinforcement Learning in Partially Observable Environments,

J. Choi and K.-E. Kim, “Inverse Reinforcement Learning in Partially Observable Environments,” Journal of Machine Learning Research , vol. 12, no. 21, pp. 691–730, 2011

work page 2011
[12]

An Inverse Reinforcement Learn- ing Algorithm for Partially Observable Domains with Application on Healthcare Dialogue Management,

H. R. Chinaei and B. Chaib-Draa, “An Inverse Reinforcement Learn- ing Algorithm for Partially Observable Domains with Application on Healthcare Dialogue Management,” in 2012 11th Int. Conf. Machine Learning and Applications , vol. 1, Dec. 2012, pp. 144–149

work page 2012
[13]

Dialogue POMDP components (Part II): learning the reward function,

H. Chinaei and B. Chaib-draa, “Dialogue POMDP components (Part II): learning the reward function,” Int. Journal of Speech Technology , vol. 17, no. 4, pp. 325–340, Dec. 2014

work page 2014
[14]

Task-Guided Inverse Reinforcement Learning under Partial Information,

F. Djeumou, M. Cubuktepe, C. Lennon, and U. Topcu, “Task-Guided Inverse Reinforcement Learning under Partial Information,” Proc. Int. Conf. on Automated Planning and Scheduling , vol. 32, pp. 53–61, Jun. 2022

work page 2022
[15]

A bayesian reinforcement learning approach for customizing human-robot interfaces,

A. Atrash and J. Pineau, “A bayesian reinforcement learning approach for customizing human-robot interfaces,” in Proc. of the 14th Int. Conf. Intelligent User Interfaces . Sanibel Island Florida USA: ACM, Feb. 2009, pp. 355–360

work page 2009
[16]

R. Y . Rubinstein and D. P. Kroese, The cross-entropy method: a uniﬁed approach to combinatorial optimization, Monte-Carlo simulation, and machine learning. Springer, 2004, vol. 133

work page 2004
[17]

DESPOT: Online POMDP Planning with Regularization,

N. Ye, A. Somani, D. Hsu, and W. S. Lee, “DESPOT: Online POMDP Planning with Regularization,” Journal of Artiﬁcial Intelligence Re- search, vol. 58, pp. 231–266, Jan. 2017

work page 2017

[1] [1]

Planning and acting in partially observable stochastic domains,

L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting in partially observable stochastic domains,” Artiﬁcial intelligence, vol. 101, no. 1-2, pp. 99–134, 1998

work page 1998

[2] [2]

The emerging landscape of explainable automated planning & decision making,

T. Chakraborti, S. Sreedharan, and S. Kambhampati, “The emerging landscape of explainable automated planning & decision making,” in Proc. 29th Int. Joint Conf. Artif. Intell., IJCAI-20 , C. Bessiere, Ed., 7 2020, pp. 4803–4811, survey track

work page 2020

[3] [3]

Plan Explanations as Model Reconciliation: Moving Beyond Explanation as Soliloquy,

T. Chakraborti, S. Sreedharan, Y . Zhang, and S. Kambhampati, “Plan Explanations as Model Reconciliation: Moving Beyond Explanation as Soliloquy,” in Proc. 26th Int. Joint Conf. Artif. Intell. , Aug. 2017, pp. 156–163

work page 2017

[4] [4]

Model-free model reconciliation,

S. Sreedharan, A. O. Hernandez, A. P. Mishra, and S. Kambhampati, “Model-free model reconciliation,” in Proc. of the 28th Int. Joint Conf. on Artiﬁcial Intelligence, IJCAI-19 , 7 2019, pp. 587–594

work page 2019

[5] [5]

Explanation-Based Reward Coaching to Improve Human Performance via Reinforcement Learning,

A. Tabrez, S. Agrawal, and B. Hayes, “Explanation-Based Reward Coaching to Improve Human Performance via Reinforcement Learning,” in 2019 14th ACM/IEEE Int. Conf. on Human-Robot Interaction (HRI) , Mar. 2019, pp. 249–257, iSSN: 2167-2148

work page 2019

[6] [6]

Trust calibration within a human-robot team: Comparing automatically generated explanations,

N. Wang, D. V . Pynadath, and S. G. Hill, “Trust calibration within a human-robot team: Comparing automatically generated explanations,” in 2016 11th ACM/IEEE Int. Conf. on Human-Robot Interaction (HRI) , Mar. 2016, pp. 109–116, iSSN: 2167-2148

work page 2016

[7] [7]

POMDPs for Assisting Homeless Shelters – Computational and Deployment Challenges,

A. Yadav, H. Chan, A. Jiang, E. Rice, E. Kamar, B. Grosz, and M. Tambe, “POMDPs for Assisting Homeless Shelters – Computational and Deployment Challenges,” in Autonomous Agents and Multiagent Systems, ser. Lecture Notes in Computer Science, N. Osman and C. Sierra, Eds. Cham: Springer International Publishing, 2016, pp. 67–87

work page 2016

[8] [8]

“Dave...I can assure you ...that it’s going to be all right

B. W. Israelsen and N. R. Ahmed, ““Dave...I can assure you ...that it’s going to be all right ...” A Deﬁnition, Case for, and Survey of Algorithmic Assurances in Human-Autonomy Trust Relationships,” ACM Computing Surveys , vol. 51, no. 6, pp. 1–37, Nov. 2019

work page 2019

[9] [9]

Aligning Robot and Human Representations,

A. Bobu, A. Peng, P. Agrawal, J. Shah, and A. D. Dragan, “Aligning Robot and Human Representations,” Feb. 2023, arXiv:2302.01928 [cs]

work page arXiv 2023

[10] [10]

In situ bidirectional human-robot value alignment,

L. Yuan, X. Gao, Z. Zheng, M. Edmonds, Y . N. Wu, F. Rossano, H. Lu, Y . Zhu, and S.-C. Zhu, “In situ bidirectional human-robot value alignment,” Science Robotics , vol. 7, no. 68, p. eabm4183, Jul. 2022, publisher: American Association for the Advancement of Science

work page 2022

[11] [11]

Inverse Reinforcement Learning in Partially Observable Environments,

J. Choi and K.-E. Kim, “Inverse Reinforcement Learning in Partially Observable Environments,” Journal of Machine Learning Research , vol. 12, no. 21, pp. 691–730, 2011

work page 2011

[12] [12]

An Inverse Reinforcement Learn- ing Algorithm for Partially Observable Domains with Application on Healthcare Dialogue Management,

H. R. Chinaei and B. Chaib-Draa, “An Inverse Reinforcement Learn- ing Algorithm for Partially Observable Domains with Application on Healthcare Dialogue Management,” in 2012 11th Int. Conf. Machine Learning and Applications , vol. 1, Dec. 2012, pp. 144–149

work page 2012

[13] [13]

Dialogue POMDP components (Part II): learning the reward function,

H. Chinaei and B. Chaib-draa, “Dialogue POMDP components (Part II): learning the reward function,” Int. Journal of Speech Technology , vol. 17, no. 4, pp. 325–340, Dec. 2014

work page 2014

[14] [14]

Task-Guided Inverse Reinforcement Learning under Partial Information,

F. Djeumou, M. Cubuktepe, C. Lennon, and U. Topcu, “Task-Guided Inverse Reinforcement Learning under Partial Information,” Proc. Int. Conf. on Automated Planning and Scheduling , vol. 32, pp. 53–61, Jun. 2022

work page 2022

[15] [15]

A bayesian reinforcement learning approach for customizing human-robot interfaces,

A. Atrash and J. Pineau, “A bayesian reinforcement learning approach for customizing human-robot interfaces,” in Proc. of the 14th Int. Conf. Intelligent User Interfaces . Sanibel Island Florida USA: ACM, Feb. 2009, pp. 355–360

work page 2009

[16] [16]

R. Y . Rubinstein and D. P. Kroese, The cross-entropy method: a uniﬁed approach to combinatorial optimization, Monte-Carlo simulation, and machine learning. Springer, 2004, vol. 133

work page 2004

[17] [17]

DESPOT: Online POMDP Planning with Regularization,

N. Ye, A. Somani, D. Hsu, and W. S. Lee, “DESPOT: Online POMDP Planning with Regularization,” Journal of Artiﬁcial Intelligence Re- search, vol. 58, pp. 231–266, Jan. 2017

work page 2017