Explainable Representation of Finite-Memory Policies for POMDPs using Decision Trees
Pith reviewed 2026-05-23 17:05 UTC · model grok-4.3
The pith
Finite-memory POMDP policies can be represented as decision trees for stationary parts plus a Mealy machine for switching to gain interpretability and smaller size.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We provide a representation of such policies, both (i) in an interpretable formalism and (ii) typically of smaller size, together yielding higher explainability. To that end, we combine models of Mealy machines and decision trees; the latter describing simple, stationary parts of the policies and the former describing how to switch among them. We design a translation for policies of the finite-state-controller form from standard literature and show how our method smoothly generalizes to other variants of finite-memory policies. Further, we identify specific properties of recently used attractor-based policies, which allow us to construct yet simpler and smaller representations.
What carries the argument
Hybrid representation that uses decision trees for stationary policy fragments and a Mealy machine to manage transitions between those fragments.
If this is right
- The translated policies are smaller in size than the original finite-state controllers.
- They achieve higher explainability through the use of an interpretable formalism.
- The same translation applies to standard finite-state controllers and extends smoothly to other finite-memory policy variants.
- Attractor-based policies receive even simpler and smaller representations under the identified properties.
Where Pith is reading between the lines
- The hybrid form may allow direct synthesis algorithms to target decision-tree components rather than full controllers.
- Case-study illustrations suggest the approach could support manual inspection of policy logic in sequential decision tasks.
- Smaller representations could lower the memory footprint when deploying policies on embedded hardware.
Load-bearing premise
Any finite-state-controller policy for a POMDP can be decomposed without loss of correctness into stationary decision-tree components plus a Mealy-machine switching layer that stays compact and readable.
What would settle it
A concrete finite-state controller policy whose decision-tree-plus-Mealy translation is either larger than the original controller or fails to preserve the exact input-output behavior on some observation sequence.
Figures
read the original abstract
Partially Observable Markov Decision Processes (POMDPs) are a fundamental framework for decision-making under uncertainty and partial observability. Since in general optimal policies may require infinite memory, they are hard to implement and often render most problems undecidable. Consequently, finite-memory policies are mostly considered instead. However, the algorithms for computing them are typically very complex, and so are the resulting policies. Facing the need for their explainability, we provide a representation of such policies, both (i) in an interpretable formalism and (ii) typically of smaller size, together yielding higher explainability. To that end, we combine models of Mealy machines and decision trees; the latter describing simple, stationary parts of the policies and the former describing how to switch among them. We design a translation for policies of the finite-state-controller (FSC) form from standard literature and show how our method smoothly generalizes to other variants of finite-memory policies. Further, we identify specific properties of recently used "attractor-based" policies, which allow us to construct yet simpler and smaller representations. Finally, we illustrate the higher explainability in a few case studies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a representation for finite-memory policies in POMDPs that combines decision trees (for stationary action and observation mappings) with Mealy machines (for state switching). It provides an explicit translation from standard finite-state controller (FSC) policies, shows generalization to other finite-memory policies, exploits properties of attractor-based policies for simpler representations, and demonstrates improved explainability via case studies.
Significance. The constructive translation from FSC policies to a DT+Mealy representation, which preserves behavior by design, is a clear strength and could meaningfully advance explainability of POMDP policies in safety-critical domains. The exploitation of attractor-based policy properties for further size reduction and the concrete case studies are additional positives that distinguish the work.
minor comments (3)
- [Translation design section] The translation algorithm would benefit from an explicit pseudocode listing or boxed procedure (e.g., in the section describing the FSC-to-DT+Mealy mapping) to make the construction fully reproducible without ambiguity in edge cases such as observation equivalence.
- [Case studies] Case studies should report both the original FSC state count and the combined DT+Mealy size (nodes + Mealy states) side-by-side, together with a brief note on how 'size' is measured, to strengthen the 'typically smaller' claim.
- [Preliminaries / Notation] Notation for the Mealy machine output function and the decision-tree leaf labels is introduced without a consolidated table of symbols; adding one would aid readability for readers unfamiliar with the combined formalism.
Simulated Author's Rebuttal
We thank the referee for the positive summary and significance assessment of our manuscript. The recommendation for minor revision is noted. As no specific major comments were provided in the report, we have no points to address point-by-point at this time.
Circularity Check
No significant circularity; constructive translation stands independently
full rationale
The paper's core contribution is an explicit, behavior-preserving translation from standard FSC policies to a DT+Mealy representation, defined directly in the construction without any reduction of output size or explainability metrics back to fitted parameters, self-cited uniqueness theorems, or input data. The decomposition into stationary DT components and switching Mealy layer is the method itself, not a derived claim that loops to its own assumptions. Case studies on attractor-based policies illustrate empirical size reduction but do not serve as load-bearing derivations. No self-definitional, fitted-input, or ansatz-smuggling patterns appear; the derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption POMDPs are a fundamental framework for decision-making under uncertainty and partial observability
- domain assumption Finite-memory policies are the practical alternative when infinite-memory policies are undecidable or unimplementable
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Andriushchenko, R.; Bork, A.; Ceska, M.; Junges, S.; Katoen, J.; and Mac \' a k, F. 2023. Search and Explore: Symbiotic Policy Synthesis in POMDPs. In Computer Aided Verification - 35th International Conference, CAV 2023, Paris, France, July 17-22, 2023, Proceedings, Part III , volume 13966 of Lecture Notes in Computer Science, 113--135. Springer
work page 2023
-
[4]
Andriushchenko, R.; Ceska, M.; Junges, S.; Katoen, J.; and Stupinsk \' y , S. 2021. PAYNT: A Tool for Inductive Synthesis of Probabilistic Programs. In Computer Aided Verification - 33rd International Conference, CAV 2021, Virtual Event, July 20-23, 2021, Proceedings, Part I , volume 12759 of Lecture Notes in Computer Science, 856--869. Springer
work page 2021
-
[5]
Ashok, P.; Jackermeier, M.; Jagtap, P.; Kret \' nsk \' y , J.; Weininger, M.; and Zamani, M. 2020. dtControl: decision tree learning algorithms for controller representation. In HSCC , 17:1--17:7. ACM
work page 2020
-
[6]
Ashok, P.; Jackermeier, M.; K r et \' nsk \' y , J.; Weinhuber, C.; Weininger, M.; and Yadav, M. 2021. dtControl 2.0: Explainable Strategy Representation via Decision Tree Learning Steered by Experts. In TACAS (2) , volume 12652 of Lecture Notes in Computer Science, 326--345. Springer
work page 2021
-
[7]
Ashok, P.; K r et\' i nsk\' y , J.; Larsen, K. G.; Co \" e nt, A. L.; Taankvist, J. H.; and Weininger, M. 2019. SOS: Safe, Optimal and Small Strategies for Hybrid M arkov Decision Processes. In QEST , volume 11785 of Lecture Notes in Computer Science, 147--164. Springer
work page 2019
-
[8]
str \"o m, K. J. 1965. Optimal control of Markov processes with incomplete state information I. Journal of mathematical analysis and applications, 10: 174--205
work page 1965
-
[9]
Baier, C.; and Katoen, J. 2008. Principles of model checking. MIT Press. ISBN 978-0-262-02649-9
work page 2008
-
[10]
Bork, A.; Chakraborty, D.; Grover, K.; K r et \' nsk \`y , J.; and Mohr, S. 2024. Learning Explainable and Better Performing Representations of POMDP Strategies. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems, 299--319. Springer
work page 2024
-
[11]
Bork, A.; Junges, S.; Katoen, J.; and Quatmann, T. 2020. Verification of Indefinite-Horizon POMDPs. In Automated Technology for Verification and Analysis - 18th International Symposium, ATVA 2020, Hanoi, Vietnam, October 19-23, 2020, Proceedings , volume 12302 of Lecture Notes in Computer Science, 288--304. Springer
work page 2020
-
[12]
Boutilier, C.; Dearden, R.; and Goldszmidt, M. 1995. Exploiting Structure in Policy Construction. In IJCAI , 1104--1113. Morgan Kaufmann
work page 1995
-
[13]
Br \' a zdil, T.; Chatterjee, K.; Chmelik, M.; Fellner, A.; and K r et\' i nsk\' y , J. 2015. Counterexample Explanation by Learning Small Strategies in M arkov Decision Processes. In CAV (1) , volume 9206 of Lecture Notes in Computer Science, 158--177. Springer
work page 2015
-
[14]
Br \' a zdil, T.; Chatterjee, K.; K r et\' i nsk\' y , J.; and Toman, V. 2018. Strategy Representation by Decision Trees in Reactive Synthesis. In TACAS (1) , volume 10805 of Lecture Notes in Computer Science, 385--407. Springer
work page 2018
-
[15]
Breiman, L. 2017. Classification and regression trees. Routledge
work page 2017
-
[16]
Carr, S.; Jansen, N.; and Topcu, U. 2022. Task-Aware Verifiable RNN-Based Policies for Partially Observable Markov Decision Processes. J. Artif. Int. Res., 72: 819–847
work page 2022
-
[17]
Chatterjee, K.; Chmelik, M.; and Davies, J. 2016. A Symbolic SAT-Based Algorithm for Almost-Sure Reachability with Small Strategies in POMDPs. In Schuurmans, D.; and Wellman, M. P., eds., Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA , 3225--3232. AAAI Press
work page 2016
-
[18]
Chatterjee, K.; Doyen, L.; and Henzinger, T. A. 2010. Qualitative analysis of partially-observable Markov decision processes. In International Symposium on Mathematical Foundations of Computer Science, 258--269. Springer
work page 2010
-
[19]
Geffner, H.; and Bonet, B. 1998. Solving large POMDPs using real time dynamic programming. In Working Notes Fall AAAI Symposium on POMDPs, volume 218
work page 1998
-
[20]
Hauskrecht, M.; and Fraser, H. 2000. Planning treatment of ischemic heart disease with partially observable Markov decision processes. Artificial intelligence in medicine, 18(3): 221--244
work page 2000
-
[21]
Hensel, C.; Junges, S.; Katoen, J.; Quatmann, T.; and Volk, M. 2022. The probabilistic model checker Storm. Int. J. Softw. Tools Technol. Transf., 24(4): 589--610
work page 2022
-
[22]
Junges, S.; Jansen, N.; and Seshia, S. A. 2021. Enforcing Almost-Sure Reachability in POMDPs. In Silva, A.; and Leino, K. R. M., eds., Computer Aided Verification - 33rd International Conference, CAV 2021, Virtual Event, July 20-23, 2021, Proceedings, Part II , volume 12760 of Lecture Notes in Computer Science, 602--625. Springer
work page 2021
-
[23]
Koul, A.; Greydanus, S.; and Fern, A. 2018. Learning Finite State Representations of Recurrent Policy Networks. ArXiv, abs/1811.12530
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[24]
Kwiatkowska, M. Z.; Norman, G.; and Parker, D. 2011. PRISM 4.0: Verification of Probabilistic Real-Time Systems. In Gopalakrishnan, G.; and Qadeer, S., eds., Computer Aided Verification - 23rd International Conference, CAV 2011, Snowbird, UT, USA, July 14-20, 2011. Proceedings , volume 6806 of Lecture Notes in Computer Science, 585--591. Springer
work page 2011
-
[25]
Littman, M. L.; Cassandra, A. R.; and Kaelbling, L. P. 1995. Learning policies for partially observable environments: Scaling up. In Machine Learning Proceedings 1995, 362--370. Elsevier
work page 1995
-
[26]
Madani, O.; Hanks, S.; and Condon, A. 2003. On the undecidability of probabilistic planning and related stochastic optimization problems. Artificial Intelligence, 147(1-2): 5--34
work page 2003
-
[27]
Maimon, O. Z.; and Rokach, L. 2014. Data mining with decision trees: theory and applications, volume 81. World scientific
work page 2014
-
[28]
Neider, D.; and Markgraf, O. 2019. Learning-Based Synthesis of Safety Controllers. In FMCAD , 120--128. IEEE
work page 2019
-
[29]
Pineau, J.; Gordon, G.; Thrun, S.; et al. 2003. Point-based value iteration: An anytime algorithm for POMDPs. In Ijcai, volume 3, 1025--1032
work page 2003
-
[30]
Pyeatt, L. D.; Howe, A. E.; et al. 2001. Decision tree function approximation in reinforcement learning. In Proceedings of the third international symposium on adaptive systems: evolutionary computation and probabilistic graphical models, volume 2, 70--77. Cuba
work page 2001
-
[31]
Silver, D.; and Veness, J. 2010. Monte-Carlo planning in large POMDPs. Advances in neural information processing systems, 23
work page 2010
-
[32]
Smallwood, R. D.; and Sondik, E. J. 1973. The Optimal Control of Partially Observable Markov Processes over a Finite Horizon. Oper. Res., 21(5): 1071--1088
work page 1973
-
[33]
Von Winterfeldt, D.; and Edwards, W. 1993. Decision analysis and behavioral research
work page 1993
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.