pith. sign in

arxiv: 2506.07298 · v3 · submitted 2025-06-08 · 💻 cs.LG · cs.AI

Pre-trained Large Language Models Learn Hidden Markov Models In-context

Pith reviewed 2026-05-19 10:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords in-context learninghidden Markov modelslarge language modelssequential predictionlatent structuresynthetic benchmarksanimal behavior data
0
0 comments X p. Extension

The pith

Pre-trained LLMs infer Hidden Markov Model structure directly from examples in a prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that large language models can use in-context examples to predict sequences generated by Hidden Markov Models, reaching accuracy close to the theoretical best possible. This holds across many different synthetic HMMs with varying properties, and the approach also matches expert-designed models on real animal decision data. A sympathetic reader would see this as evidence that in-context learning alone can recover latent transition and emission patterns without any model training or fine-tuning. If correct, it means scientists could apply off-the-shelf LLMs as a diagnostic tool for sequential data that contains hidden Markovian structure.

Core claim

On a diverse set of synthetic HMMs, pre-trained LLMs achieve predictive accuracy approaching the theoretical optimum through in-context learning, and the same method yields competitive results on real-world animal decision-making tasks compared with models built by human experts.

What carries the argument

In-context learning, the process by which the LLM extracts transition and emission probabilities from a small set of example sequences placed inside the prompt.

If this is right

  • Researchers can treat LLMs as ready-made sequence predictors for any data suspected to follow Markovian hidden-state dynamics.
  • The observed scaling trends with model size and prompt length supply practical rules for choosing how many examples to include when analyzing new sequential datasets.
  • ICL performance on HMM tasks offers a new benchmark for measuring how well language models capture latent probabilistic structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompt-based recovery might extend to other latent-variable models such as linear dynamical systems or partially observable Markov decision processes.
  • If the scaling trends hold, increasing context length could substitute for explicit parameter estimation in many scientific time-series problems.
  • This capability suggests LLMs could serve as quick first-pass analyzers before committing resources to specialized fitting algorithms.

Load-bearing premise

The provided in-context examples contain enough information for the LLM to recover the underlying transition and emission structure of the HMM without any further training.

What would settle it

Run the same prompt format on a new family of HMMs whose parameters are known exactly, and check whether the LLM's next-token prediction error remains strictly above the Bayes-optimal error computed from the true model.

Figures

Figures reproduced from arXiv: 2506.07298 by Jennifer J. Sun, Sarah Dean, Yahya Sattar, Yijia Dai, Zhaolin Gao.

Figure 1
Figure 1. Figure 1: Overview of our study. We start by studying whether ICL using pre-trained LLMs can converge to theoretical optimum on HMM sequences (Q1, Section 2), then study how HMMs properties affect the convergence rate/gap with theoretical conjectures (Q2, Section 3), and finally we demonstrate how these findings translate to insights on real-world datasets for studying behaviors in science (Q3, Section 4). capacity … view at source ↗
Figure 2
Figure 2. Figure 2: Properties of HMMs. 2.1 HMM Background Hidden Markov model: HMMs impose a set of probabilistic assumptions on how sequences of data are generated. The elements of the sequence are called observations, denoted at each step t by Ot. The observations depend on a hidden state denoted by Xt, which evolves according to a Markov chain. A HMM is characterized by the Markov chain’s initial state distribution and it… view at source ↗
Figure 3
Figure 3. Figure 3: (Left) We define T as when LLM converges (see Appendix B for computation metric), and ε as the final accuracy gap at sequence length 2048. (Middle) Examples when LLM accuracy converges to Viterbi. Each curve represents a different HMM parameter setting. LLM ICL shows consistent convergence behavior. (Right) Examples of convergence in Hellinger distance (distance between two probability distributions). LLM … view at source ↗
Figure 4
Figure 4. Figure 4: (Left) Convergence gap ε increases with higher mixing rate (slower mixing) and higher entropy. This plot is showing results averaged across all HMM configurations we tested. (Right) Slower mixing (λ2 = 0.5, 0.75) shows delayed convergence compared to (Middle) fast mixing (λ2 = 0.95, 0.99) at similar entropy levels [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: HMM parameters M = 8, L = 8, H(A) = 1.5, H(B) = 1. (Left) The gap between P(Ot+1|Ot) and Viterbi is small when mixing is fast. (Middle) Accuracy comparison with baselines. (Right) Hellinger distance measures distance between two probability distributions. 3.1 In-context Scaling Trends Neural scaling laws describe empirical power-law relationships that characterize how neural network performance improves wi… view at source ↗
Figure 6
Figure 6. Figure 6: IBL dataset mice decision-making task. (Left) GLM-HMM model developed by neurosci￾entists. (Middle) A cartoon illustration of the task. A mouse observes a visual stimulus presented on one side of a screen, with one of six possible intensity levels. It then chooses a side, receiving a water reward if the choice matches the stimulus location. (Right) LLM ICL performance curve averaged across all animals, wit… view at source ↗
Figure 7
Figure 7. Figure 7: Rat reward-learning task. (Left) Analog agent learning to HMMs. (Middle) A cartoon illustration of the more challenging task. No stimulus is presented on either side; instead, the reward probabilities for left and right choices evolve independently via random walks. As the optimal choice changes over time, the rat must learn and adapt its decisions based solely on the history of past rewards. (Right) LLM I… view at source ↗
Figure 8
Figure 8. Figure 8: The singular value decomposition of ergodic unichain Markov matrix [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Accuracies of six methods across different [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Hellinger distances of six methods across different [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Accuracies of six methods across different mixing rates ( [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Hellinger distances of six methods across different mixing rates ( [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Accuracies of six methods across different steady state distributions, [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Hellinger distances of six methods across different steady state distributions, [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Accuracies of seven models across different [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Hellinger distances of seven models across different [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Accuracies of seven models across different mixing rates ( [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Hellinger distances of seven models across different mixing rates ( [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Accuracies of seven models across different steady state distributions, [PITH_FULL_IMAGE:figures/full_fig_p031_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Hellinger distances of seven models across different steady state distributions, [PITH_FULL_IMAGE:figures/full_fig_p032_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Accuracy of three tokenization methods across different mixing rates ( [PITH_FULL_IMAGE:figures/full_fig_p033_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Hellinger distance of three tokenization methods across different mixing rates ( [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: LLM in-context learning prediction accuracy for mice decision-making task with varying [PITH_FULL_IMAGE:figures/full_fig_p040_23.png] view at source ↗
read the original abstract

Hidden Markov Models (HMMs) are foundational tools for modeling sequential data with latent Markovian structure, yet fitting them to real-world data remains computationally challenging. In this work, we show that pre-trained large language models (LLMs) can effectively model data generated by HMMs via in-context learning (ICL)$\unicode{x2013}$their ability to infer patterns from examples within a prompt. On a diverse set of synthetic HMMs, LLMs achieve predictive accuracy approaching the theoretical optimum. We uncover novel scaling trends influenced by HMM properties, and offer theoretical conjectures for these empirical observations. We also provide practical guidelines for scientists on using ICL as a diagnostic tool for complex data. On real-world animal decision-making tasks, ICL achieves competitive performance with models designed by human experts. To our knowledge, this is the first demonstration that ICL can learn and predict HMM-generated sequences$\unicode{x2013}$an advance that deepens our understanding of in-context learning in LLMs and establishes its potential as a powerful tool for uncovering hidden structure in complex scientific data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that pre-trained large language models (LLMs) can effectively model data generated by Hidden Markov Models (HMMs) via in-context learning (ICL). On a diverse set of synthetic HMMs, LLMs achieve predictive accuracy approaching the theoretical optimum. The work uncovers novel scaling trends influenced by HMM properties, offers theoretical conjectures, provides practical guidelines for using ICL as a diagnostic tool, and demonstrates competitive performance on real-world animal decision-making tasks.

Significance. If the empirical findings are robust and the theoretical optimum is confirmed to be the exact marginal predictive distribution from the ground-truth HMM, this would represent a significant advance in understanding in-context learning. It suggests LLMs can perform implicit Bayesian filtering on latent Markov chains, which has implications for both LLM theory and applications in scientific data analysis. The controlled synthetic experiments and real-world validation are positive aspects. The paper also earns credit for exploring scaling behaviors and providing guidelines.

major comments (2)
  1. [Experimental Evaluation] The definition of the 'theoretical optimum' requires clarification. The paper should specify in the methods or experimental section whether this is computed as the exact next-token probability using the forward algorithm with the known transition matrix A and emission matrix B on the prompt sequence. If it relies on an approximate method like fitting an HMM to the examples or using a non-marginalized baseline, the claim that LLMs approach the HMM optimum and thus learn the hidden structure does not fully hold. This is central to validating the main result.
  2. [Theoretical Conjectures] The conjectures for the observed scaling trends (e.g., dependence on number of hidden states or transition entropy) should be stated more precisely, ideally with a supporting derivation or connection to ICL literature, rather than purely empirical observation.
minor comments (2)
  1. [Abstract] Consider adding a brief mention of the models tested (e.g., specific LLM families) and the range of HMM complexities to give readers a quicker sense of the scope.
  2. [Figures] Improve clarity of plots showing scaling trends by including error bars from multiple runs and clearly labeling what the 'optimum' line represents.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments and constructive feedback. We address each major comment point by point below, with revisions planned where appropriate.

read point-by-point responses
  1. Referee: [Experimental Evaluation] The definition of the 'theoretical optimum' requires clarification. The paper should specify in the methods or experimental section whether this is computed as the exact next-token probability using the forward algorithm with the known transition matrix A and emission matrix B on the prompt sequence. If it relies on an approximate method like fitting an HMM to the examples or using a non-marginalized baseline, the claim that LLMs approach the HMM optimum and thus learn the hidden structure does not fully hold. This is central to validating the main result.

    Authors: We thank the referee for highlighting this point. In the experiments, the theoretical optimum is computed exactly as the next-token probability via the forward algorithm using the known transition matrix A and emission matrix B on the full prompt sequence, corresponding to the exact marginal predictive distribution of the ground-truth HMM. We agree that the methods section did not make this explicit enough. We will revise the manuscript to include a precise description of this computation, along with the relevant equations, to confirm it is the exact optimum rather than an approximation. revision: yes

  2. Referee: [Theoretical Conjectures] The conjectures for the observed scaling trends (e.g., dependence on number of hidden states or transition entropy) should be stated more precisely, ideally with a supporting derivation or connection to ICL literature, rather than purely empirical observation.

    Authors: We appreciate this recommendation. The conjectures in the current manuscript are based on systematic empirical observations across varied HMM parameters. We will revise the relevant section to formulate the conjectures more precisely and to draw explicit connections to existing ICL literature on implicit Bayesian inference and transformer scaling behaviors. A complete theoretical derivation lies outside the scope of this work, but the revision will strengthen the presentation and grounding of these observations. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical performance claims use independent ground-truth HMM benchmark

full rationale

The paper reports empirical results on synthetic HMM sequences where LLM in-context predictions are compared to the theoretical optimum defined by the known transition and emission matrices via the forward algorithm. No equations, derivations, or fitted parameters are presented that reduce to self-definition or self-citation by construction. Scaling trends and conjectures are offered as post-hoc interpretations of observed performance, not as load-bearing premises. The central claim is externally falsifiable against the ground-truth HMM predictor and does not rely on any self-referential loop or ansatz smuggled through prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; evaluation therefore records an empty ledger with note that full paper may contain additional modeling assumptions.

pith-pipeline@v0.9.0 · 5727 in / 941 out tokens · 21857 ms · 2026-05-19T10:29:58.135370+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 4 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    A method of moments for mixture models and hidden markov models

    Animashree Anandkumar, Daniel Hsu, and Sham M Kakade. A method of moments for mixture models and hidden markov models. In Conference on learning theory , pages 33–1. JMLR Workshop and Conference Proceedings, 2012

  3. [3]

    Mice alternate between discrete strategies during perceptual decision-making

    Zoe C Ashwood, Nicholas A Roy, Iris R Stone, International Brain Laboratory, Anne E Urai, Anne K Churchland, Alexandre Pouget, and Jonathan W Pillow. Mice alternate between discrete strategies during perceptual decision-making. Nature Neuroscience, 25(2):201–212, 2022

  4. [4]

    Vector- based navigation using grid-like representations in artificial agents

    Andrea Banino, Caswell Barry, Benigno Uria, Charles Blundell, Timothy Lillicrap, Piotr Mirowski, Alexander Pritzel, Martin J Chadwick, Thomas Degris, Joseph Modayil, et al. Vector- based navigation using grid-like representations in artificial agents. Nature, 557(7705):429–433, 2018

  5. [5]

    A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains

    Leonard E Baum, Ted Petrie, George Soules, and Norman Weiss. A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The annals of mathematical statistics, 41(1):164–171, 1970

  6. [6]

    Birth of a transformer: A memory viewpoint

    Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, and Leon Bottou. Birth of a transformer: A memory viewpoint. Advances in Neural Information Processing Systems, 36: 1560–1588, 2023

  7. [7]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  8. [8]

    Springer Series in Statistics

    Olivier Cappé, Eric Moulines, and Tobias Rydén.Inference in Hidden Markov Models. Springer Series in Statistics. Springer, New York, NY , 1st edition, 2005. ISBN 978-0-387-40264-2. doi: 10.1007/0-387-28982-8

  9. [9]

    George Casella and Edward I. George. Explaining the gibbs sampler, 1992

  10. [10]

    Discovering symbolic cognitive models from human and animal behavior

    Pablo Samuel Castro, Nenad Tomasev, Ankit Anand, Navodita Sharma, Rishika Mohanta, Aparna Dev, Kuba Perlin, Siddhant Jain, Kyle Levin, Noémi Éltet ˝o, Will Dabney, Alexan- der Novikov, Glenn C Turner, Maria K Eckstein, Nathaniel D Daw, Kevin J Miller, and Kimberly L Stachenfeld. Discovering symbolic cognitive models from human and animal behavior. bioRxiv...

  11. [11]

    Stephanie C. Y . Chan, Adam Santoro, Andrew K. Lampinen, Jane X. Wang, Aaditya Singh, Pierre H. Richemond, Jay McClelland, and Felix Hill. Data distributional properties drive emergent in-context learning in transformers, 2022. URL https://arxiv.org/abs/2205. 05055

  12. [12]

    Cover and Joy A

    Thomas M. Cover and Joy A. Thomas.Elements of Information Theory (Wiley Series in Telecom- munications and Signal Processing). Wiley-Interscience, USA, 2006. ISBN 0471241954

  13. [13]

    Edelman, Ezra Edelman, Surbhi Goel, Eran Malach, and Nikolaos Tsilivis

    Benjamin L. Edelman, Ezra Edelman, Surbhi Goel, Eran Malach, and Nikolaos Tsilivis. The evolution of statistical induction heads: In-context learning markov chains, 2024. URL https: //arxiv.org/abs/2402.11004

  14. [14]

    Ephraim and N

    Y . Ephraim and N. Merhav. Hidden markov processes. IEEE Transactions on Information Theory, 48(6):1518–1569, 2002. doi: 10.1109/TIT.2002.1003838

  15. [15]

    Discrete stochastic processes

    Robert G Gallager. Discrete stochastic processes. Journal of the Operational Research Society, 48(1):103–103, 1997

  16. [16]

    Stochastic relaxation, gibbs distributions, and the bayesian restoration of images

    Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on pattern analysis and machine intelligence, (6): 721–741, 1984. 11

  17. [17]

    Hidden markov models: Pitfalls and opportunities in ecology

    Richard Glennie, Timo Adam, Vianey Leos-Barajas, Théo Michelot, Theoni Photopoulou, and Brett T McClintock. Hidden markov models: Pitfalls and opportunities in ecology. Methods in Ecology and Evolution, 14(1):43–56, 2023

  18. [18]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  19. [19]

    Large language models are zero-shot time series forecasters, 2024

    Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew Gordon Wilson. Large language models are zero-shot time series forecasters, 2024. URL https://arxiv.org/abs/2310.07820

  20. [20]

    Enough coin flips can make llms act bayesian

    Ritwik Gupta, Rodolfo Corona, Jiaxin Ge, Eric Wang, Dan Klein, Trevor Darrell, and David M Chan. Enough coin flips can make llms act bayesian. arXiv preprint arXiv:2503.04722, 2025

  21. [21]

    A spectral algorithm for learning hidden markov models

    Daniel Hsu, Sham M Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences, 78(5):1460–1480, 2012

  22. [22]

    Do llms dream of elephants (when told not to)? latent concept association and associative memory in transformers,

    Yibo Jiang, Goutham Rajendran, Pradeep Ravikumar, and Bryon Aragam. Do llms dream of elephants (when told not to)? latent concept association and associative memory in transformers,

  23. [23]

    URL https://arxiv.org/abs/2406.18400

  24. [24]

    Michael I. Jordan. Attractor dynamics and parallelism in a connectionist sequential machine, page 112–127. IEEE Press, 1990. ISBN 0818620153

  25. [25]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361

  26. [26]

    Standardized and reproducible measurement of decision-making in mice

    The International Brain Laboratory, Valeria Aguillon-Rodriguez, Dora Angelaki, Hannah Bayer, Niccolo Bonacchi, Matteo Carandini, Fanny Cazettes, Gaelle Chapuis, Anne K Churchland, Yang Dan, Eric Dewitt, Mayo Faulkner, Hamish Forrest, Laura Haetzel, Michael Häusser, Sonja B Hofer, Fei Hu, Anup Khanal, Christopher Krasniak, Ines Laranjeira, Zachary F Mainen...

  27. [27]

    Markov chains and mixing times , volume 107

    David A Levin and Yuval Peres. Markov chains and mixing times , volume 107. American Mathematical Soc., 2017

  28. [28]

    Trans- formers as algorithms: Generalization and stability in in-context learning

    Yingcong Li, Muhammed Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Trans- formers as algorithms: Generalization and stability in in-context learning. In International conference on machine learning, pages 19565–19594. PMLR, 2023

  29. [29]

    Observability and reconstructibility of hidden markov models: Implications for control and network congestion control

    Andrew R Liu and Robert R Bitmead. Observability and reconstructibility of hidden markov models: Implications for control and network congestion control. In 49th IEEE Conference on Decision and Control (CDC), pages 918–923. IEEE, 2010

  30. [30]

    Toni J. B. Liu, Nicolas Boullé, Raphaël Sarfati, and Christopher J. Earls. Llms learn governing principles of dynamical systems, revealing an in-context neural scaling law, 2024. URL https://arxiv.org/abs/2402.00795

  31. [31]

    Toni J. B. Liu, Nicolas Boullé, Raphaël Sarfati, and Christopher J. Earls. Density estimation with llms: a geometric investigation of in-context learning trajectories, 2025. URL https: //arxiv.org/abs/2410.05218

  32. [32]

    Bridging the usability gap: Theoretical and methodological advances for spectral learning of hidden markov models

    Xiaoyuan Ma and Jordan Rodu. Bridging the usability gap: Theoretical and methodological advances for spectral learning of hidden markov models. arXiv preprint arXiv:2302.07437, 2023. 12

  33. [33]

    How hidden are hidden processes? a primer on crypticity and entropy convergence

    John R Mahoney, Christopher J Ellison, Ryan G James, and James P Crutchfield. How hidden are hidden processes? a primer on crypticity and entropy convergence. Chaos: An Interdisciplinary Journal of Nonlinear Science, 21(3), 2011

  34. [34]

    Attention with markov: A framework for principled analysis of transformers via markov chains, 2024

    Ashok Vardhan Makkuva, Marco Bondaschi, Adway Girish, Alliot Nagle, Martin Jaggi, Hyeji Kim, and Michael Gastpar. Attention with markov: A framework for principled analysis of transformers via markov chains, 2024. URL https://arxiv.org/abs/2402.04161

  35. [35]

    Uncovering ecological state dynamics with hidden markov models

    Brett T McClintock, Roland Langrock, Olivier Gimenez, Emmanuelle Cam, David L Borchers, Richard Glennie, and Toby A Patterson. Uncovering ecological state dynamics with hidden markov models. Ecology letters, 23(12):1878–1903, 2020

  36. [36]

    Bernstein inequality and moderate deviations under strong mixing conditions

    Florence Merlevède, Magda Peligrad, and Emmanuel Rio. Bernstein inequality and moderate deviations under strong mixing conditions. In High dimensional probability V: the Luminy volume, volume 5, pages 273–293. Institute of Mathematical Statistics, 2009

  37. [37]

    Miller, Matthew M

    Kevin J. Miller, Matthew M. Botvinick, and Carlos D. Brody. From predictive models to cognitive models: Separable behavioral processes underlying reward learning in the rat.bioRxiv,

  38. [38]

    URL https://www.biorxiv.org/content/early/2021/02/ 19/461129

    doi: 10.1101/461129. URL https://www.biorxiv.org/content/early/2021/02/ 19/461129

  39. [39]

    Optimal regularization can mitigate double descent

    Preetum Nakkiran, Prayaag Venkat, Sham Kakade, and Tengyu Ma. Optimal regularization can mitigate double descent. arXiv preprint arXiv:2003.01897, 2020

  40. [40]

    2 OLMo 2 Furious

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious. arXiv preprint arXiv:2501.00656, 2024

  41. [41]

    A tutorial on hidden markov models and selected applications in speech recognition

    Lawrence R Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989

  42. [42]

    Transformers on markov data: Constant depth suffices, 2024

    Nived Rajaraman, Marco Bondaschi, Kannan Ramchandran, Michael Gastpar, and Ashok Vard- han Makkuva. Transformers on markov data: Constant depth suffices, 2024. URL https: //arxiv.org/abs/2407.17686

  43. [43]

    An analysis of tokenization: Trans- formers under markov data

    Nived Rajaraman, Jiantao Jiao, and Kannan Ramchandran. An analysis of tokenization: Trans- formers under markov data. Advances in Neural Information Processing Systems, 37:62503– 62556, 2024

  44. [44]

    Spectral estimation of hidden Markov models

    Jordan Rodu. Spectral estimation of hidden Markov models. University of Pennsylvania, 2014

  45. [45]

    Mice in a labyrinth show rapid learning, sudden insight, and efficient exploration

    Matthew Rosenberg, Tony Zhang, Pietro Perona, and Markus Meister. Mice in a labyrinth show rapid learning, sudden insight, and efficient exploration. eLife, 10:e66175, jul 2021. ISSN 2050-084X. doi: 10.7554/eLife.66175. URL https://doi.org/10.7554/eLife.66175

  46. [46]

    The mouse action recognition system (mars) software pipeline for automated analysis of social behaviors in mice

    Cristina Segalin, Jalani Williams, Tomomi Karigo, May Hui, Moriel Zelikowsky, Jennifer J Sun, Pietro Perona, David J Anderson, and Ann Kennedy. The mouse action recognition system (mars) software pipeline for automated analysis of social behaviors in mice. eLife, 10:e63720, nov 2021. ISSN 2050-084X. doi: 10.7554/eLife.63720. URL https://doi.org/10.7554/ e...

  47. [47]

    Improper learning for non-stochastic control

    Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-stochastic control. In Conference on Learning Theory, pages 3320–3436. PMLR, 2020

  48. [48]

    Sun, Ann Kennedy, Eric Zhan, David J

    Jennifer J. Sun, Ann Kennedy, Eric Zhan, David J. Anderson, Yisong Yue, and Pietro Perona. Task programming: Learning data efficient behavior representations, 2021. URL https: //arxiv.org/abs/2011.13917

  49. [49]

    Fitzgerald, and Nelson Spruston

    Weinan Sun, Johan Winnubst, Maanasa Natrajan, Chongxi Lai, Koichiro Kajikawa, Michalis Michaelos, Rachel Gattoni, James E. Fitzgerald, and Nelson Spruston. Learning produces a hippocampal cognitive map in the form of an orthogonalized state machine. bioRxiv, 2023. doi: 10.1101/2023.08.03.551900. URL https://www.biorxiv.org/content/early/2023/ 08/07/2023.0...

  50. [50]

    Facemap: a framework for modeling neural activity based on orofacial tracking

    Atika Syeda, Lin Zhong, Renee Tung, Will Long, Marius Pachitariu, and Carsen Stringer. Facemap: a framework for modeling neural activity based on orofacial tracking. Nature neuroscience, 27(1):187–195, 2024

  51. [51]

    Are language models actually useful for time series forecasting? Advances in Neural Information Processing Systems, 37:60162–60191, 2024

    Mingtian Tan, Mike Merrill, Vinayak Gupta, Tim Althoff, and Tom Hartvigsen. Are language models actually useful for time series forecasting? Advances in Neural Information Processing Systems, 37:60162–60191, 2024

  52. [52]

    Quinn, Benjamin A.E

    Diego Vidaurre, Laurence T Hunt, Andrew J. Quinn, Benjamin A.E. Hunt, Matthew J. Brookes, Anna C. Nobre, and Mark W. Woolrich. Spontaneous cortical activity transiently organises into frequency specific phase-coupling networks. bioRxiv, 2017. doi: 10.1101/150607. URL https://www.biorxiv.org/content/early/2017/10/20/150607

  53. [53]

    A. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2):260–269, 1967. doi: 10.1109/TIT. 1967.1054010

  54. [54]

    Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning, 2024

    Xinyi Wang, Wanrong Zhu, Michael Saxon, Mark Steyvers, and William Yang Wang. Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning, 2024. URL https://arxiv.org/abs/2301.11916

  55. [55]

    Larger language models do in-context learning differently, 2023

    Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, and Tengyu Ma. Larger language models do in-context learning differently, 2023. URL https://arxiv.org/abs/2303.03846

  56. [56]

    Wills, Colin Lever, Francesca Cacucci, Neil Burgess, and John O’Keefe

    Thomas J. Wills, Colin Lever, Francesca Cacucci, Neil Burgess, and John O’Keefe. Attractor dynamics in the hippocampal representation of the local environment. Science, 308:873 – 876,

  57. [57]

    URL https://api.semanticscholar.org/CorpusID:13909368

  58. [58]

    An explanation of in- context learning as implicit bayesian inference, 2022

    Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in- context learning as implicit bayesian inference, 2022. URL https://arxiv.org/abs/2111. 02080

  59. [59]

    Wainwright

    Fanny Yang, Sivaraman Balakrishnan, and Martin J. Wainwright. Statistical and computa- tional guarantees for the baum-welch algorithm, 2015. URL https://arxiv.org/abs/1512. 08269

  60. [60]

    convergence

    Walter Zucchini and Peter Guttorp. A hidden markov model for space-time precipitation. Water Resources Research, 27(8):1917–1923, 1991. 14 Appendices Table of Contents • Appendix A: Additional Background on HMMs • Appendix B: Additional Details of Experimental Setup • Appendix C: Details of Benchmark Models • Appendix D: Additional Synthetic Experiment Re...

  61. [61]

    showed that, with probability at least 1 − δ, we have, ∥ ˆP(⊥) 1 − P1∥ ≲ q log(1/δ) ¯N + q 1 ¯N . In the following, we will upper bound the term ∥ ˆP1 − ˆP(⊥) 1 ∥ by considering entry-wise concentration of each ℓ-th subtrajectory as follows: We have [ ˆP(ℓ) 1 ]i − [ ˆP(⊥) 1 ]i = P ¯N k=1 1{okT −ℓ=i} − 1{o(k) T =i} ¯N . (8) First, we observe that E h 1{okT...

  62. [62]

    choice only

    Moreover, |1{okT −ℓ=i} − 1{o(k) T =i}| ≤ 1, almost surely. However, the summation in(8) has weakly dependent terms. Therefore, we use the Bernstein type inequality for a class of weakly dependent and bounded random variables proposed in [35]. Before that, we need to upper bound the variance of the summation in (8). Observing that E h [ ˆP(ℓ) 1 ]i − [ ˆP(⊥...