L*-Based Learning of Markov Decision Processes (Extended Version)
Pith reviewed 2026-05-25 13:28 UTC · model grok-4.3
The pith
An L*-based sampling algorithm learns the complete structure of deterministic Markov decision processes from test traces.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present a novel sampling-based L* algorithm for learning deterministic Markov decision processes that collects information by sampling system traces via testing. It learns the complete model structure including the states and achieves better accuracy than state-of-the-art passive learning techniques with the same amount of test data.
What carries the argument
The sampling-based L* algorithm for MDPs, which builds observation tables from traces to identify states as distinguishable future behaviors and infers transitions.
If this is right
- Learned MDP models can support automated verification and testing of the original system.
- The algorithm enables model learning in settings where the number of states is unknown in advance.
- With limited test data, more accurate models become available for decision processes.
- Active sampling reduces the data needed compared to passive collection of logs.
Where Pith is reading between the lines
- Such methods might be combined with reinforcement learning to learn and optimize policies simultaneously.
- Extending the approach to nondeterministic or continuous-state MDPs could broaden its applicability.
- Comparing the learned models against ground truth in simulation environments would validate the accuracy gains.
Load-bearing premise
The system under learning is a deterministic Markov decision process whose behavior is fully captured by finite traces sampled through testing.
What would settle it
Apply the algorithm to a small known deterministic MDP with a fixed number of samples and check whether the output model matches the original states, transitions, and probabilities exactly.
Figures
read the original abstract
Automata learning techniques automatically generate system models from test observations. These techniques usually fall into two categories: passive and active. Passive learning uses a predetermined data set, e.g., system logs. In contrast, active learning actively queries the system under learning, which is considered more efficient. An influential active learning technique is Angluin's L* algorithm for regular languages which inspired several generalisations from DFAs to other automata-based modelling formalisms. In this work, we study L*-based learning of deterministic Markov decision processes, first assuming an ideal setting with perfect information. Then, we relax this assumption and present a novel learning algorithm that collects information by sampling system traces via testing. Experiments with the implementation of our sampling-based algorithm suggest that it achieves better accuracy than state-of-the-art passive learning techniques with the same amount of test data. Unlike existing learning algorithms with predefined states, our algorithm learns the complete model structure including the states.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper extends Angluin's L* algorithm to deterministic Markov decision processes. It first gives a learning procedure under an ideal perfect-information setting, then relaxes the assumption to present a sampling-based algorithm that collects finite traces via testing. The algorithm is claimed to learn the complete MDP structure (including states) rather than assuming a predefined state space, and experiments with its implementation are said to indicate higher accuracy than state-of-the-art passive learners when given the same volume of test data.
Significance. If the empirical superiority holds under proper quantitative scrutiny, the work would be a useful contribution to automata learning by generalizing L* to MDPs and by automatically inferring the state space. This addresses a practical gap in model inference for systems whose state structure is not known in advance and could support downstream tasks such as verification and testing.
major comments (1)
- [Abstract] Abstract: the central claim that the sampling-based algorithm 'achieves better accuracy than state-of-the-art passive learning techniques with the same amount of test data' is load-bearing for the paper's contribution, yet the abstract supplies no quantitative results, error bars, number of runs, or definition of the accuracy metric. Without these details the empirical outcome cannot be assessed from the summary alone.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recommendation of minor revision. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the sampling-based algorithm 'achieves better accuracy than state-of-the-art passive learning techniques with the same amount of test data' is load-bearing for the paper's contribution, yet the abstract supplies no quantitative results, error bars, number of runs, or definition of the accuracy metric. Without these details the empirical outcome cannot be assessed from the summary alone.
Authors: The abstract is a high-level summary of the contribution and is not intended to contain the full quantitative details of the evaluation. The accuracy metric is defined in Section 4, and the experimental results—including direct comparisons to passive learners on the same data volume, accuracy values, and supporting figures—are reported in full in Section 5. This follows the standard structure in which the abstract states the outcome at a qualitative level while the body supplies the supporting data and definitions. revision: no
Circularity Check
No significant circularity
full rationale
The paper's central contribution is an algorithmic description of an L*-style active learner for deterministic MDPs together with an empirical claim that the implemented sampler outperforms passive baselines on accuracy for equal test-data volume while also recovering state structure. No equation or theorem in the provided text reduces a reported prediction to a parameter fitted inside the same paper, nor does any load-bearing premise rest on a self-citation chain whose validity is presupposed by the present work. The deterministic-MDP assumption is stated explicitly and then relaxed in the sampling construction, leaving the experimental result independent of internal redefinitions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The system under learning is a deterministic Markov decision process.
Reference graph
Works this paper leans on
-
[1]
In: Bennaceur, A., H¨ ahnle, R., Meinke, K
Aichernig, B.K., Mostowski, W., Mousavi, M.R., Tappler, M., Taromirad, M.: Model learning and model-based testing. In: Bennaceur, A., H¨ ahnle, R., Meinke, K. (eds.) Machine Learning for Dynamic Software Analysis: P otentials and Limits - International Dagstuhl Seminar 16172, Dagstuhl Castle, G ermany, April 24-27, 2016, Revised Papers. Lecture Notes in C...
-
[2]
Journal of Automated Reasoning (Oct 2 018), https://doi.org/10.1007/s10817-018-9486-0
Aichernig, B.K., Tappler, M.: Efficient active automata le arning via mutation testing. Journal of Automated Reasoning (Oct 2 018), https://doi.org/10.1007/s10817-018-9486-0
-
[3]
Formal Methods in System Design (Ma y 2019), https://doi.org/10.1007/s10703-019-00333-0
Aichernig, B.K., Tappler, M.: Probabilistic black-box r eachability check- ing (extended version). Formal Methods in System Design (Ma y 2019), https://doi.org/10.1007/s10703-019-00333-0
-
[4]
Angluin, D.: Learning regular sets from queries and count erexamples. Inf. Comput. 75(2), 87–106 (1987), https://doi.org/10.1016/0890-5401(87)90052-6
-
[5]
Aspnes, J., Herlihy, M.: Fast randomized consensus using shared memory. J. Algo- rithms 11(3), 441–461 (1990), https://doi.org/10.1016/0196-6774(90)90021-6
-
[6]
http://people.cs.aau.dk/~giovbacci/tools/bisimdist.zip, accessed: June 28, 2019
Bacci, G., Bacci, G., Larsen, K.G., Mardare, R.: MDPDist l ibrary. http://people.cs.aau.dk/~giovbacci/tools/bisimdist.zip, accessed: June 28, 2019
work page 2019
-
[7]
Bacci, G., Bacci, G., Larsen, K.G., Mardare, R.: Computin g behavioral distances, compositionally. In: Chatterjee, K., Sgall, J. (eds.) Mathemati- cal Foundations of Computer Science 2013 - 38th Internation al Symposium, MFCS 2013, Klosterneuburg, Austria, August 26-30, 2013. Pr oceedings. Lec- ture Notes in Computer Science, vol. 8087, pp. 74–85. Spring ...
-
[8]
In: Joshi, K.R., Siegle, M., Stoelinga, M., D’Argenio, P.R
Bacci, G., Bacci, G., Larsen, K.G., Mardare, R.: The Bisim Dist library: Efficient computation of bisimilarity distances for Markovian model s. In: Joshi, K.R., Siegle, M., Stoelinga, M., D’Argenio, P.R. (eds.) Quantitative Eva luation of Systems - 10th International Conference, QEST 2013, Buenos Aires, Ar gentina, August 27- 30, 2013. Proceedings. Lecture ...
-
[9]
Baier, C., Katoen, J.: Principles of model checking. MIT P ress (2008)
work page 2008
-
[10]
Bergadano, F., Varricchio, S.: Learning behaviors of au tomata from multi- plicity and equivalence queries. SIAM J. Comput. 25(6), 1268–1280 (1996), https://doi.org/10.1137/S009753979326091X
-
[11]
Carrasco, R.C., Oncina, J.: Learning stochastic regula r grammars by means of a state merging method. In: Carrasco, R.C., Oncina, J. (eds.) Grammatical Infer- ence and Applications, Second International Colloquium, I CGI-94, Alicante, Spain, September 21-23, 1994, Proceedings. Lecture Notes in Compu ter Science, vol. 862, pp. 139–152. Springer (1994), htt...
-
[12]
ITA 33(1), 1–20 (1999), https://doi.org/10.1051/ita:1999102
Carrasco, R.C., Oncina, J.: Learning deterministic reg ular grammars from stochastic samples in polynomial time. ITA 33(1), 1–20 (1999), https://doi.org/10.1051/ita:1999102
-
[13]
Cassel, S., Howar, F., Jonsson, B., Steffen, B.: Active le arning for ex- tended finite state machines. Formal Asp. Comput. 28(2), 233–263 (2016), https://doi.org/10.1007/s00165-016-0355-5
-
[14]
Castro, J., Gavald` a, R.: Learning Probability Distrib utions Generated by Finite- State Machines, pp. 113–142. Springer Berlin Heidelberg, B erlin, Heidelberg (2016), https://doi.org/10.1007/978-3-662-48395-4_5
-
[15]
Chen, Y., Nielsen, T.D.: Active learning of Markov decis ion processes for system verification. In: 11th International Conference on Machine Learning and Appli- cations, ICMLA, Boca Raton, FL, USA, December 12-15, 2012. V olume 2. pp. 289–294. IEEE (2012), https://doi.org/10.1109/ICMLA.2012.158
-
[16]
Chow, T.S.: Testing software design modeled by finite-st ate machines. IEEE Trans. Softw. Eng. 4(3), 178–187 (May 1978) L∗-Based Learning of Markov Decision Processes (Extended Ver sion) 41
work page 1978
-
[17]
Feng, L., Han, T., Kwiatkowska, M.Z., Parker, D.: Learni ng-based composi- tional verification for synchronous probabilistic systems . In: Bultan, T., Hsi- ung, P. (eds.) Automated Technology for Verification and Ana lysis, 9th Interna- tional Symposium, ATV A 2011, Taipei, Taiwan, October 11-14 , 2011. Proceed- ings. Lecture Notes in Computer Science, vol....
-
[18]
: Combining model learn- ing and model checking to analyze TCP implementations
Fiter˘ au-Bro¸ stean, P., Janssen, R., Vaandrager, F.W. : Combining model learn- ing and model checking to analyze TCP implementations. In: C haudhuri, S., Farzan, A. (eds.) Computer Aided Verification - 28th Interna tional Confer- ence, CA V 2016, Toronto, ON, Canada, July 17-23, 2016, Proce edings, Part II. Lecture Notes in Computer Science, vol. 9780, p...
-
[19]
Forejt, V., Kwiatkowska, M.Z., Norman, G., Parker, D.: A utomated verifica- tion techniques for probabilistic systems. In: Bernardo, M ., Issarny, V. (eds.) Formal Methods for Eternal Networked Software Systems - 11t h International School on Formal Methods for the Design of Computer, Communi cation and Software Systems, SFM 2011, Bertinoro, Italy, June 13...
-
[20]
Fu, J., Topcu, U.: Probably approximately correct MDP le arning and control with temporal logic constraints. In: Fox, D., Kavraki, L.E. , Kurniawati, H. (eds.) Robotics: Science and Systems X, University of California, Berkeley, USA, July 12-16, 2014 (2014), http://www.roboticsproceedings.org/rss10/p39.html
work page 2014
-
[21]
Cambridge University Press, New York, NY, USA (2010)
de la Higuera, C.: Grammatical Inference: Learning Auto mata and Grammars. Cambridge University Press, New York, NY, USA (2010)
work page 2010
-
[22]
Hoeffding, W.: Probability inequalities for sums of boun ded random vari- ables. Journal of the American Statistical Association 58(301), 13–30 (1963), http://www.jstor.org/stable/2282952
-
[23]
In: Bennaceur, A., H¨ ahn le, R., Meinke, K
Howar, F., Steffen, B.: Active automata learning in pract ice - an annotated bib- liography of the years 2011 to 2016. In: Bennaceur, A., H¨ ahn le, R., Meinke, K. (eds.) Machine Learning for Dynamic Software Analysis: Pot entials and Limits - International Dagstuhl Seminar 16172, Dagstuhl Castle, G ermany, April 24-27, 2016, Revised Papers. Lecture Notes ...
-
[24]
In: Hunt Jr., W.A., Somenzi, F
Hungar, H., Niese, O., Steffen, B.: Domain-specific optim ization in automata learn- ing. In: Hunt Jr., W.A., Somenzi, F. (eds.) Computer Aided Ve rification, 15th International Conference, CA V 2003, Boulder, CO, USA, July 8-12, 2003, Pro- ceedings. Lecture Notes in Computer Science, vol. 2725, pp. 315–327. Springer (2003), https://doi.org/10.1007/978-3-540...
-
[25]
In: Bonakdarpour, B., S molka, S.A
Isberner, M., Howar, F., Steffen, B.: The TTT algorithm: A redundancy-free ap- proach to active automata learning. In: Bonakdarpour, B., S molka, S.A. (eds.) Run- time Verification - 5th International Conference, R V 2014, T oronto, ON, Canada, September 22-25, 2014. Proceedings. Lecture Notes in Compu ter Science, vol. 8734, pp. 307–322. Springer (2014), h...
-
[26]
In: Clark, A., Kanazawa, M., Yoshinaka, R
Khalili, A., Tacchella, A.: Learning nondeterministic Mealy machines. In: Clark, A., Kanazawa, M., Yoshinaka, R. (eds.) Proceedings of the 12th I nternational Confer- ence on Grammatical Inference, ICGI 2014, Kyoto, Japan, Sep tember 17-19, 2014. JMLR Workshop and Conference Proceedings, vol. 34, pp. 109– 123. JMLR.org (2014), http://jmlr.org/proceedings/...
work page 2014
-
[27]
Kwiatkowska, M.Z., Norman, G., Parker, D.: Analysis of a gossip protocol in PRISM. SIGMETRICS Performance Evaluation Review 36(3), 17–22 (2008), https://doi.org/10.1145/1481506.1481511
-
[28]
In: Gopalakrishnan, G., Qadeer, S
Kwiatkowska, M.Z., Norman, G., Parker, D.: PRISM 4.0: Ve rification of probabilis- tic real-time systems. In: Gopalakrishnan, G., Qadeer, S. ( eds.) Computer Aided Verification - 23rd International Conference, CA V 2011, Sno wbird, UT, USA, July 14-20, 2011. Proceedings. Lecture Notes in Computer Scienc e, vol. 6806, pp. 585–
work page 2011
-
[29]
Springer (2011), https://doi.org/10.1007/978-3-642-22110-1
-
[30]
In: Fahre nberg, U., Legay, A., Thrane, C.R
Mao, H., Chen, Y., Jaeger, M., Nielsen, T.D., Larsen, K.G ., Nielsen, B.: Learn- ing Markov decision processes for model checking. In: Fahre nberg, U., Legay, A., Thrane, C.R. (eds.) Proceedings Quantities in Formal Me thods, QFM 2012, Paris, France, 28 August 2012. EPTCS, vol. 103, pp. 49– 63 (2012), https://doi.org/10.4204/EPTCS.103.6
-
[31]
Machine Learning 105(2), 255–299 (2016), https://doi.org/10.1007/s10994-016-5565-9
Mao, H., Chen, Y., Jaeger, M., Nielsen, T.D., Larsen, K.G ., Nielsen, B.: Learning deterministic probabilistic automata from a model checkin g perspective. Machine Learning 105(2), 255–299 (2016), https://doi.org/10.1007/s10994-016-5565-9
-
[32]
In: Ninth IEEE Internati onal High-Level Design Validation and Test Workshop 2004
Margaria, T., Niese, O., Raffelt, H., Steffen, B.: Efficient test-based model gener- ation for legacy reactive systems. In: Ninth IEEE Internati onal High-Level Design Validation and Test Workshop 2004. pp. 95–100. IEEE Compute r Society (2004)
work page 2004
-
[33]
Nerode, A.: Linear automaton transformations. Proc. Am . Math. Soc. 9, 541–544 (1958)
work page 1958
-
[34]
Journal of Computer Security 14(6), 561–589 (2006)
Norman, G., Shmatikov, V.: Analysis of probabilistic co ntract signing. Journal of Computer Security 14(6), 561–589 (2006)
work page 2006
-
[35]
Rivest, R.L., Schapire, R.E.: Inference of finite automa ta using homing sequences. Inf. Comput. 103(2), 299–347 (1993), https://doi.org/10.1006/inco.1993.1021
-
[36]
de Ruiter, J., Poll, E.: Protocol state fuzzing of TLS imp lementations. In: Jung, J., Holz, T. (eds.) 24th USENIX Security Symposium, USENIX Secu rity 15, Wash- ington, D.C., USA, August 12-14, 2015. pp. 193–206. USENIX A ssociation (2015)
work page 2015
-
[37]
Shahbaz, M., Groz, R.: Inferring Mealy machines. In: Cav alcanti, A., Dams, D. (eds.) FM 2009: Formal Methods, Second World Congre ss, Eindhoven, The Netherlands, November 2-6, 2009. Proceedin gs. Lecture Notes in Computer Science, vol. 5850, pp. 207–222. Springer (2009), https://doi.org/10.1007/978-3-642-05089-3_14
-
[38]
Steffen, B., Howar, F., Merten, M.: Introduction to activ e automata learning from a practical perspective. In: Bernardo, M., Issarny, V. (eds.) Formal Meth- ods for Eternal Networked Software Systems - 11th Internati onal School on Formal Methods for the Design of Computer, Communication an d Software Systems, SFM 2011, Bertinoro, Italy, June 13-18, 2011....
-
[39]
https://doi.org/10.6084/m9.figshare.7960928.v1
Tappler, M.: Evaluation material for L∗-based learning of Markov decision pro- cesses. https://doi.org/10.6084/m9.figshare.7960928.v1
-
[40]
Tappler, M., Aichernig, B.K., Bloem, R.: Model-based te sting IoT commu- nication via active automata learning. In: 2017 IEEE Intern ational Confer- ence on Software Testing, Verification and Validation, ICST 2017, Tokyo, Japan, March 13-17, 2017. pp. 276–287. IEEE Computer Societ y (2017), https://doi.org/10.1109/ICST.2017.32
-
[41]
Soft- ware - Concepts and Tools 17(3), 103–120 (1996)
Tretmans, J.: Test generation with inputs, outputs and r epetitive quiescence. Soft- ware - Concepts and Tools 17(3), 103–120 (1996)
work page 1996
-
[42]
Tzeng, W.: Learning probabilistic automata and Markov c hains via queries. Ma- chine Learning 8, 151–166 (1992), https://doi.org/10.1007/BF00992862 L∗-Based Learning of Markov Decision Processes (Extended Ver sion) 43
-
[43]
Valiant, L.G.: A theory of the learnable. Commun. ACM 27(11), 1134–1142 (1984), https://doi.org/10.1145/1968.1972
-
[44]
ECEASST 72 (2015), https://doi.org/10.14279/tuj.eceasst.72.1008
Volpato, M., Tretmans, J.: Approximate active learning of nonde- terministic input output transition systems. ECEASST 72 (2015), https://doi.org/10.14279/tuj.eceasst.72.1008
-
[45]
In: Brim, L., Haverkort, B.R., Leucker, M., van de Pol, J
Willemse, T.A.C.: Heuristics for ioco-based test-base d modelling. In: Brim, L., Haverkort, B.R., Leucker, M., van de Pol, J. (eds.) Formal Me thods: Applications and Technology, 11th International Workshop, FMICS 2006 an d 5th International Workshop PDMC 2006, Bonn, Germany, August 26-27, and August 31, 2006, Re- vised Selected Papers. Lecture Notes in C...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.