pith. sign in

arxiv: 1906.12239 · v1 · pith:ZS37PKNQnew · submitted 2019-06-28 · 💻 cs.LG · stat.ML

L*-Based Learning of Markov Decision Processes (Extended Version)

Pith reviewed 2026-05-25 13:28 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords automata learningMarkov decision processesactive learningL* algorithmmodel inferencesamplingdeterministic MDPs
0
0 comments X

The pith

An L*-based sampling algorithm learns the complete structure of deterministic Markov decision processes from test traces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an active learning technique for deterministic Markov decision processes that extends Angluin's L* algorithm. In an ideal setting it assumes perfect information, but then relaxes this to a practical sampling method that gathers data by testing the system and observing traces. The method infers both the states and the transition structure without predefined states. Experiments show it achieves higher accuracy than passive learning approaches when given the same volume of test data. This matters because it automates the creation of accurate models for systems whose internal states are not directly observable.

Core claim

We present a novel sampling-based L* algorithm for learning deterministic Markov decision processes that collects information by sampling system traces via testing. It learns the complete model structure including the states and achieves better accuracy than state-of-the-art passive learning techniques with the same amount of test data.

What carries the argument

The sampling-based L* algorithm for MDPs, which builds observation tables from traces to identify states as distinguishable future behaviors and infers transitions.

If this is right

  • Learned MDP models can support automated verification and testing of the original system.
  • The algorithm enables model learning in settings where the number of states is unknown in advance.
  • With limited test data, more accurate models become available for decision processes.
  • Active sampling reduces the data needed compared to passive collection of logs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such methods might be combined with reinforcement learning to learn and optimize policies simultaneously.
  • Extending the approach to nondeterministic or continuous-state MDPs could broaden its applicability.
  • Comparing the learned models against ground truth in simulation environments would validate the accuracy gains.

Load-bearing premise

The system under learning is a deterministic Markov decision process whose behavior is fully captured by finite traces sampled through testing.

What would settle it

Apply the algorithm to a small known deterministic MDP with a fixed number of samples and check whether the output model matches the original states, transitions, and probabilities exactly.

Figures

Figures reproduced from arXiv: 1906.12239 by Bernhard K. Aichernig, Giovanni Bacci, Kim G. Larsen, Maria Eichlseder, Martin Tappler.

Figure 1
Figure 1. Figure 1: MDP model of a faulty coffee machine We learn deterministic labelled MDPs as learned by passive learning techniques like IoAler￾gia [30]. Such MDPs define at most one successor state for each source state and input-output pair [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The first gridworld Models similar to our gridworlds have, e.g., been considered in the context of learn￾ing control strategies [20]. Basically, a robot moves around in a world of tiles of different terrains. It may make errors in movement, e.g. move south west instead of south with an error probability depending on the target terrain. Our aim is to learn an environment model, i.e. a map [PITH_FULL_IMAGE:… view at source ↗
Figure 3
Figure 3. Figure 3: The second gridworld [PITH_FULL_IMAGE:figures/full_fig_p033_3.png] view at source ↗
read the original abstract

Automata learning techniques automatically generate system models from test observations. These techniques usually fall into two categories: passive and active. Passive learning uses a predetermined data set, e.g., system logs. In contrast, active learning actively queries the system under learning, which is considered more efficient. An influential active learning technique is Angluin's L* algorithm for regular languages which inspired several generalisations from DFAs to other automata-based modelling formalisms. In this work, we study L*-based learning of deterministic Markov decision processes, first assuming an ideal setting with perfect information. Then, we relax this assumption and present a novel learning algorithm that collects information by sampling system traces via testing. Experiments with the implementation of our sampling-based algorithm suggest that it achieves better accuracy than state-of-the-art passive learning techniques with the same amount of test data. Unlike existing learning algorithms with predefined states, our algorithm learns the complete model structure including the states.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper extends Angluin's L* algorithm to deterministic Markov decision processes. It first gives a learning procedure under an ideal perfect-information setting, then relaxes the assumption to present a sampling-based algorithm that collects finite traces via testing. The algorithm is claimed to learn the complete MDP structure (including states) rather than assuming a predefined state space, and experiments with its implementation are said to indicate higher accuracy than state-of-the-art passive learners when given the same volume of test data.

Significance. If the empirical superiority holds under proper quantitative scrutiny, the work would be a useful contribution to automata learning by generalizing L* to MDPs and by automatically inferring the state space. This addresses a practical gap in model inference for systems whose state structure is not known in advance and could support downstream tasks such as verification and testing.

major comments (1)
  1. [Abstract] Abstract: the central claim that the sampling-based algorithm 'achieves better accuracy than state-of-the-art passive learning techniques with the same amount of test data' is load-bearing for the paper's contribution, yet the abstract supplies no quantitative results, error bars, number of runs, or definition of the accuracy metric. Without these details the empirical outcome cannot be assessed from the summary alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the sampling-based algorithm 'achieves better accuracy than state-of-the-art passive learning techniques with the same amount of test data' is load-bearing for the paper's contribution, yet the abstract supplies no quantitative results, error bars, number of runs, or definition of the accuracy metric. Without these details the empirical outcome cannot be assessed from the summary alone.

    Authors: The abstract is a high-level summary of the contribution and is not intended to contain the full quantitative details of the evaluation. The accuracy metric is defined in Section 4, and the experimental results—including direct comparisons to passive learners on the same data volume, accuracy values, and supporting figures—are reported in full in Section 5. This follows the standard structure in which the abstract states the outcome at a qualitative level while the body supplies the supporting data and definitions. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central contribution is an algorithmic description of an L*-style active learner for deterministic MDPs together with an empirical claim that the implemented sampler outperforms passive baselines on accuracy for equal test-data volume while also recovering state structure. No equation or theorem in the provided text reduces a reported prediction to a parameter fitted inside the same paper, nor does any load-bearing premise rest on a self-citation chain whose validity is presupposed by the present work. The deterministic-MDP assumption is stated explicitly and then relaxed in the sampling construction, leaving the experimental result independent of internal redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract supplies insufficient detail to enumerate free parameters or invented entities; the central claim rests on the domain assumption that the target system is a deterministic MDP whose transitions can be resolved by trace sampling.

axioms (1)
  • domain assumption The system under learning is a deterministic Markov decision process.
    Stated when relaxing the ideal setting to sampling.

pith-pipeline@v0.9.0 · 5701 in / 1210 out tokens · 22247 ms · 2026-05-25T13:28:44.397129+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

  1. [1]

    In: Bennaceur, A., H¨ ahnle, R., Meinke, K

    Aichernig, B.K., Mostowski, W., Mousavi, M.R., Tappler, M., Taromirad, M.: Model learning and model-based testing. In: Bennaceur, A., H¨ ahnle, R., Meinke, K. (eds.) Machine Learning for Dynamic Software Analysis: P otentials and Limits - International Dagstuhl Seminar 16172, Dagstuhl Castle, G ermany, April 24-27, 2016, Revised Papers. Lecture Notes in C...

  2. [2]

    Journal of Automated Reasoning (Oct 2 018), https://doi.org/10.1007/s10817-018-9486-0

    Aichernig, B.K., Tappler, M.: Efficient active automata le arning via mutation testing. Journal of Automated Reasoning (Oct 2 018), https://doi.org/10.1007/s10817-018-9486-0

  3. [3]

    Formal Methods in System Design (Ma y 2019), https://doi.org/10.1007/s10703-019-00333-0

    Aichernig, B.K., Tappler, M.: Probabilistic black-box r eachability check- ing (extended version). Formal Methods in System Design (Ma y 2019), https://doi.org/10.1007/s10703-019-00333-0

  4. [4]

    Angluin, D.: Learning regular sets from queries and count erexamples. Inf. Comput. 75(2), 87–106 (1987), https://doi.org/10.1016/0890-5401(87)90052-6

  5. [5]

    Aspnes, J., Herlihy, M.: Fast randomized consensus using shared memory. J. Algo- rithms 11(3), 441–461 (1990), https://doi.org/10.1016/0196-6774(90)90021-6

  6. [6]

    http://people.cs.aau.dk/~giovbacci/tools/bisimdist.zip, accessed: June 28, 2019

    Bacci, G., Bacci, G., Larsen, K.G., Mardare, R.: MDPDist l ibrary. http://people.cs.aau.dk/~giovbacci/tools/bisimdist.zip, accessed: June 28, 2019

  7. [7]

    In: Chatterjee, K., Sgall, J

    Bacci, G., Bacci, G., Larsen, K.G., Mardare, R.: Computin g behavioral distances, compositionally. In: Chatterjee, K., Sgall, J. (eds.) Mathemati- cal Foundations of Computer Science 2013 - 38th Internation al Symposium, MFCS 2013, Klosterneuburg, Austria, August 26-30, 2013. Pr oceedings. Lec- ture Notes in Computer Science, vol. 8087, pp. 74–85. Spring ...

  8. [8]

    In: Joshi, K.R., Siegle, M., Stoelinga, M., D’Argenio, P.R

    Bacci, G., Bacci, G., Larsen, K.G., Mardare, R.: The Bisim Dist library: Efficient computation of bisimilarity distances for Markovian model s. In: Joshi, K.R., Siegle, M., Stoelinga, M., D’Argenio, P.R. (eds.) Quantitative Eva luation of Systems - 10th International Conference, QEST 2013, Buenos Aires, Ar gentina, August 27- 30, 2013. Proceedings. Lecture ...

  9. [9]

    MIT P ress (2008)

    Baier, C., Katoen, J.: Principles of model checking. MIT P ress (2008)

  10. [10]

    Bergadano, F., Varricchio, S.: Learning behaviors of au tomata from multi- plicity and equivalence queries. SIAM J. Comput. 25(6), 1268–1280 (1996), https://doi.org/10.1137/S009753979326091X

  11. [11]

    In: Carrasco, R.C., Oncina, J

    Carrasco, R.C., Oncina, J.: Learning stochastic regula r grammars by means of a state merging method. In: Carrasco, R.C., Oncina, J. (eds.) Grammatical Infer- ence and Applications, Second International Colloquium, I CGI-94, Alicante, Spain, September 21-23, 1994, Proceedings. Lecture Notes in Compu ter Science, vol. 862, pp. 139–152. Springer (1994), htt...

  12. [12]

    ITA 33(1), 1–20 (1999), https://doi.org/10.1051/ita:1999102

    Carrasco, R.C., Oncina, J.: Learning deterministic reg ular grammars from stochastic samples in polynomial time. ITA 33(1), 1–20 (1999), https://doi.org/10.1051/ita:1999102

  13. [13]

    Formal Asp

    Cassel, S., Howar, F., Jonsson, B., Steffen, B.: Active le arning for ex- tended finite state machines. Formal Asp. Comput. 28(2), 233–263 (2016), https://doi.org/10.1007/s00165-016-0355-5

  14. [14]

    Castro, J., Gavald` a, R.: Learning Probability Distrib utions Generated by Finite- State Machines, pp. 113–142. Springer Berlin Heidelberg, B erlin, Heidelberg (2016), https://doi.org/10.1007/978-3-662-48395-4_5

  15. [15]

    In: 11th International Conference on Machine Learning and Appli- cations, ICMLA, Boca Raton, FL, USA, December 12-15, 2012

    Chen, Y., Nielsen, T.D.: Active learning of Markov decis ion processes for system verification. In: 11th International Conference on Machine Learning and Appli- cations, ICMLA, Boca Raton, FL, USA, December 12-15, 2012. V olume 2. pp. 289–294. IEEE (2012), https://doi.org/10.1109/ICMLA.2012.158

  16. [16]

    IEEE Trans

    Chow, T.S.: Testing software design modeled by finite-st ate machines. IEEE Trans. Softw. Eng. 4(3), 178–187 (May 1978) L∗-Based Learning of Markov Decision Processes (Extended Ver sion) 41

  17. [17]

    In: Bultan, T., Hsi- ung, P

    Feng, L., Han, T., Kwiatkowska, M.Z., Parker, D.: Learni ng-based composi- tional verification for synchronous probabilistic systems . In: Bultan, T., Hsi- ung, P. (eds.) Automated Technology for Verification and Ana lysis, 9th Interna- tional Symposium, ATV A 2011, Taipei, Taiwan, October 11-14 , 2011. Proceed- ings. Lecture Notes in Computer Science, vol....

  18. [18]

    : Combining model learn- ing and model checking to analyze TCP implementations

    Fiter˘ au-Bro¸ stean, P., Janssen, R., Vaandrager, F.W. : Combining model learn- ing and model checking to analyze TCP implementations. In: C haudhuri, S., Farzan, A. (eds.) Computer Aided Verification - 28th Interna tional Confer- ence, CA V 2016, Toronto, ON, Canada, July 17-23, 2016, Proce edings, Part II. Lecture Notes in Computer Science, vol. 9780, p...

  19. [19]

    In: Bernardo, M ., Issarny, V

    Forejt, V., Kwiatkowska, M.Z., Norman, G., Parker, D.: A utomated verifica- tion techniques for probabilistic systems. In: Bernardo, M ., Issarny, V. (eds.) Formal Methods for Eternal Networked Software Systems - 11t h International School on Formal Methods for the Design of Computer, Communi cation and Software Systems, SFM 2011, Bertinoro, Italy, June 13...

  20. [20]

    In: Fox, D., Kavraki, L.E

    Fu, J., Topcu, U.: Probably approximately correct MDP le arning and control with temporal logic constraints. In: Fox, D., Kavraki, L.E. , Kurniawati, H. (eds.) Robotics: Science and Systems X, University of California, Berkeley, USA, July 12-16, 2014 (2014), http://www.roboticsproceedings.org/rss10/p39.html

  21. [21]

    Cambridge University Press, New York, NY, USA (2010)

    de la Higuera, C.: Grammatical Inference: Learning Auto mata and Grammars. Cambridge University Press, New York, NY, USA (2010)

  22. [22]

    Journal of the American Statistical Association 58(301), 13–30 (1963), http://www.jstor.org/stable/2282952

    Hoeffding, W.: Probability inequalities for sums of boun ded random vari- ables. Journal of the American Statistical Association 58(301), 13–30 (1963), http://www.jstor.org/stable/2282952

  23. [23]

    In: Bennaceur, A., H¨ ahn le, R., Meinke, K

    Howar, F., Steffen, B.: Active automata learning in pract ice - an annotated bib- liography of the years 2011 to 2016. In: Bennaceur, A., H¨ ahn le, R., Meinke, K. (eds.) Machine Learning for Dynamic Software Analysis: Pot entials and Limits - International Dagstuhl Seminar 16172, Dagstuhl Castle, G ermany, April 24-27, 2016, Revised Papers. Lecture Notes ...

  24. [24]

    In: Hunt Jr., W.A., Somenzi, F

    Hungar, H., Niese, O., Steffen, B.: Domain-specific optim ization in automata learn- ing. In: Hunt Jr., W.A., Somenzi, F. (eds.) Computer Aided Ve rification, 15th International Conference, CA V 2003, Boulder, CO, USA, July 8-12, 2003, Pro- ceedings. Lecture Notes in Computer Science, vol. 2725, pp. 315–327. Springer (2003), https://doi.org/10.1007/978-3-540...

  25. [25]

    In: Bonakdarpour, B., S molka, S.A

    Isberner, M., Howar, F., Steffen, B.: The TTT algorithm: A redundancy-free ap- proach to active automata learning. In: Bonakdarpour, B., S molka, S.A. (eds.) Run- time Verification - 5th International Conference, R V 2014, T oronto, ON, Canada, September 22-25, 2014. Proceedings. Lecture Notes in Compu ter Science, vol. 8734, pp. 307–322. Springer (2014), h...

  26. [26]

    In: Clark, A., Kanazawa, M., Yoshinaka, R

    Khalili, A., Tacchella, A.: Learning nondeterministic Mealy machines. In: Clark, A., Kanazawa, M., Yoshinaka, R. (eds.) Proceedings of the 12th I nternational Confer- ence on Grammatical Inference, ICGI 2014, Kyoto, Japan, Sep tember 17-19, 2014. JMLR Workshop and Conference Proceedings, vol. 34, pp. 109– 123. JMLR.org (2014), http://jmlr.org/proceedings/...

  27. [27]

    SIGMETRICS Performance Evaluation Review 36(3), 17–22 (2008), https://doi.org/10.1145/1481506.1481511

    Kwiatkowska, M.Z., Norman, G., Parker, D.: Analysis of a gossip protocol in PRISM. SIGMETRICS Performance Evaluation Review 36(3), 17–22 (2008), https://doi.org/10.1145/1481506.1481511

  28. [28]

    In: Gopalakrishnan, G., Qadeer, S

    Kwiatkowska, M.Z., Norman, G., Parker, D.: PRISM 4.0: Ve rification of probabilis- tic real-time systems. In: Gopalakrishnan, G., Qadeer, S. ( eds.) Computer Aided Verification - 23rd International Conference, CA V 2011, Sno wbird, UT, USA, July 14-20, 2011. Proceedings. Lecture Notes in Computer Scienc e, vol. 6806, pp. 585–

  29. [29]

    Springer (2011), https://doi.org/10.1007/978-3-642-22110-1

  30. [30]

    In: Fahre nberg, U., Legay, A., Thrane, C.R

    Mao, H., Chen, Y., Jaeger, M., Nielsen, T.D., Larsen, K.G ., Nielsen, B.: Learn- ing Markov decision processes for model checking. In: Fahre nberg, U., Legay, A., Thrane, C.R. (eds.) Proceedings Quantities in Formal Me thods, QFM 2012, Paris, France, 28 August 2012. EPTCS, vol. 103, pp. 49– 63 (2012), https://doi.org/10.4204/EPTCS.103.6

  31. [31]

    Machine Learning 105(2), 255–299 (2016), https://doi.org/10.1007/s10994-016-5565-9

    Mao, H., Chen, Y., Jaeger, M., Nielsen, T.D., Larsen, K.G ., Nielsen, B.: Learning deterministic probabilistic automata from a model checkin g perspective. Machine Learning 105(2), 255–299 (2016), https://doi.org/10.1007/s10994-016-5565-9

  32. [32]

    In: Ninth IEEE Internati onal High-Level Design Validation and Test Workshop 2004

    Margaria, T., Niese, O., Raffelt, H., Steffen, B.: Efficient test-based model gener- ation for legacy reactive systems. In: Ninth IEEE Internati onal High-Level Design Validation and Test Workshop 2004. pp. 95–100. IEEE Compute r Society (2004)

  33. [33]

    Nerode, A.: Linear automaton transformations. Proc. Am . Math. Soc. 9, 541–544 (1958)

  34. [34]

    Journal of Computer Security 14(6), 561–589 (2006)

    Norman, G., Shmatikov, V.: Analysis of probabilistic co ntract signing. Journal of Computer Security 14(6), 561–589 (2006)

  35. [35]

    Rivest, R.L., Schapire, R.E.: Inference of finite automa ta using homing sequences. Inf. Comput. 103(2), 299–347 (1993), https://doi.org/10.1006/inco.1993.1021

  36. [36]

    In: Jung, J., Holz, T

    de Ruiter, J., Poll, E.: Protocol state fuzzing of TLS imp lementations. In: Jung, J., Holz, T. (eds.) 24th USENIX Security Symposium, USENIX Secu rity 15, Wash- ington, D.C., USA, August 12-14, 2015. pp. 193–206. USENIX A ssociation (2015)

  37. [37]

    In: Cav alcanti, A., Dams, D

    Shahbaz, M., Groz, R.: Inferring Mealy machines. In: Cav alcanti, A., Dams, D. (eds.) FM 2009: Formal Methods, Second World Congre ss, Eindhoven, The Netherlands, November 2-6, 2009. Proceedin gs. Lecture Notes in Computer Science, vol. 5850, pp. 207–222. Springer (2009), https://doi.org/10.1007/978-3-642-05089-3_14

  38. [38]

    In: Bernardo, M., Issarny, V

    Steffen, B., Howar, F., Merten, M.: Introduction to activ e automata learning from a practical perspective. In: Bernardo, M., Issarny, V. (eds.) Formal Meth- ods for Eternal Networked Software Systems - 11th Internati onal School on Formal Methods for the Design of Computer, Communication an d Software Systems, SFM 2011, Bertinoro, Italy, June 13-18, 2011....

  39. [39]

    https://doi.org/10.6084/m9.figshare.7960928.v1

    Tappler, M.: Evaluation material for L∗-based learning of Markov decision pro- cesses. https://doi.org/10.6084/m9.figshare.7960928.v1

  40. [40]

    In: 2017 IEEE Intern ational Confer- ence on Software Testing, Verification and Validation, ICST 2017, Tokyo, Japan, March 13-17, 2017

    Tappler, M., Aichernig, B.K., Bloem, R.: Model-based te sting IoT commu- nication via active automata learning. In: 2017 IEEE Intern ational Confer- ence on Software Testing, Verification and Validation, ICST 2017, Tokyo, Japan, March 13-17, 2017. pp. 276–287. IEEE Computer Societ y (2017), https://doi.org/10.1109/ICST.2017.32

  41. [41]

    Soft- ware - Concepts and Tools 17(3), 103–120 (1996)

    Tretmans, J.: Test generation with inputs, outputs and r epetitive quiescence. Soft- ware - Concepts and Tools 17(3), 103–120 (1996)

  42. [42]

    Ma- chine Learning 8, 151–166 (1992), https://doi.org/10.1007/BF00992862 L∗-Based Learning of Markov Decision Processes (Extended Ver sion) 43

    Tzeng, W.: Learning probabilistic automata and Markov c hains via queries. Ma- chine Learning 8, 151–166 (1992), https://doi.org/10.1007/BF00992862 L∗-Based Learning of Markov Decision Processes (Extended Ver sion) 43

  43. [43]

    Valiant, L.G.: A theory of the learnable. Commun. ACM 27(11), 1134–1142 (1984), https://doi.org/10.1145/1968.1972

  44. [44]

    ECEASST 72 (2015), https://doi.org/10.14279/tuj.eceasst.72.1008

    Volpato, M., Tretmans, J.: Approximate active learning of nonde- terministic input output transition systems. ECEASST 72 (2015), https://doi.org/10.14279/tuj.eceasst.72.1008

  45. [45]

    In: Brim, L., Haverkort, B.R., Leucker, M., van de Pol, J

    Willemse, T.A.C.: Heuristics for ioco-based test-base d modelling. In: Brim, L., Haverkort, B.R., Leucker, M., van de Pol, J. (eds.) Formal Me thods: Applications and Technology, 11th International Workshop, FMICS 2006 an d 5th International Workshop PDMC 2006, Bonn, Germany, August 26-27, and August 31, 2006, Re- vised Selected Papers. Lecture Notes in C...