Pre-trained Large Language Models Learn Hidden Markov Models In-context

arxiv: 2506.07298 · v3 · submitted 2025-06-08 · 💻 cs.LG · cs.AI

Pre-trained Large Language Models Learn Hidden Markov Models In-context

Yijia Dai , Zhaolin Gao , Yahya Sattar , Sarah Dean , Jennifer J. Sun This is my paper

Pith reviewed 2026-05-19 10:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords in-context learninghidden Markov modelslarge language modelssequential predictionlatent structuresynthetic benchmarksanimal behavior data

0 comments p. Extension

The pith

Pre-trained LLMs infer Hidden Markov Model structure directly from examples in a prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that large language models can use in-context examples to predict sequences generated by Hidden Markov Models, reaching accuracy close to the theoretical best possible. This holds across many different synthetic HMMs with varying properties, and the approach also matches expert-designed models on real animal decision data. A sympathetic reader would see this as evidence that in-context learning alone can recover latent transition and emission patterns without any model training or fine-tuning. If correct, it means scientists could apply off-the-shelf LLMs as a diagnostic tool for sequential data that contains hidden Markovian structure.

Core claim

On a diverse set of synthetic HMMs, pre-trained LLMs achieve predictive accuracy approaching the theoretical optimum through in-context learning, and the same method yields competitive results on real-world animal decision-making tasks compared with models built by human experts.

What carries the argument

In-context learning, the process by which the LLM extracts transition and emission probabilities from a small set of example sequences placed inside the prompt.

If this is right

Researchers can treat LLMs as ready-made sequence predictors for any data suspected to follow Markovian hidden-state dynamics.
The observed scaling trends with model size and prompt length supply practical rules for choosing how many examples to include when analyzing new sequential datasets.
ICL performance on HMM tasks offers a new benchmark for measuring how well language models capture latent probabilistic structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompt-based recovery might extend to other latent-variable models such as linear dynamical systems or partially observable Markov decision processes.
If the scaling trends hold, increasing context length could substitute for explicit parameter estimation in many scientific time-series problems.
This capability suggests LLMs could serve as quick first-pass analyzers before committing resources to specialized fitting algorithms.

Load-bearing premise

The provided in-context examples contain enough information for the LLM to recover the underlying transition and emission structure of the HMM without any further training.

What would settle it

Run the same prompt format on a new family of HMMs whose parameters are known exactly, and check whether the LLM's next-token prediction error remains strictly above the Bayes-optimal error computed from the true model.

Figures

Figures reproduced from arXiv: 2506.07298 by Jennifer J. Sun, Sarah Dean, Yahya Sattar, Yijia Dai, Zhaolin Gao.

**Figure 1.** Figure 1: Overview of our study. We start by studying whether ICL using pre-trained LLMs can converge to theoretical optimum on HMM sequences (Q1, Section 2), then study how HMMs properties affect the convergence rate/gap with theoretical conjectures (Q2, Section 3), and finally we demonstrate how these findings translate to insights on real-world datasets for studying behaviors in science (Q3, Section 4). capacity … view at source ↗

**Figure 2.** Figure 2: Properties of HMMs. 2.1 HMM Background Hidden Markov model: HMMs impose a set of probabilistic assumptions on how sequences of data are generated. The elements of the sequence are called observations, denoted at each step t by Ot. The observations depend on a hidden state denoted by Xt, which evolves according to a Markov chain. A HMM is characterized by the Markov chain’s initial state distribution and it… view at source ↗

**Figure 3.** Figure 3: (Left) We define T as when LLM converges (see Appendix B for computation metric), and ε as the final accuracy gap at sequence length 2048. (Middle) Examples when LLM accuracy converges to Viterbi. Each curve represents a different HMM parameter setting. LLM ICL shows consistent convergence behavior. (Right) Examples of convergence in Hellinger distance (distance between two probability distributions). LLM … view at source ↗

**Figure 4.** Figure 4: (Left) Convergence gap ε increases with higher mixing rate (slower mixing) and higher entropy. This plot is showing results averaged across all HMM configurations we tested. (Right) Slower mixing (λ2 = 0.5, 0.75) shows delayed convergence compared to (Middle) fast mixing (λ2 = 0.95, 0.99) at similar entropy levels [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: HMM parameters M = 8, L = 8, H(A) = 1.5, H(B) = 1. (Left) The gap between P(Ot+1|Ot) and Viterbi is small when mixing is fast. (Middle) Accuracy comparison with baselines. (Right) Hellinger distance measures distance between two probability distributions. 3.1 In-context Scaling Trends Neural scaling laws describe empirical power-law relationships that characterize how neural network performance improves wi… view at source ↗

**Figure 6.** Figure 6: IBL dataset mice decision-making task. (Left) GLM-HMM model developed by neuroscientists. (Middle) A cartoon illustration of the task. A mouse observes a visual stimulus presented on one side of a screen, with one of six possible intensity levels. It then chooses a side, receiving a water reward if the choice matches the stimulus location. (Right) LLM ICL performance curve averaged across all animals, wit… view at source ↗

**Figure 7.** Figure 7: Rat reward-learning task. (Left) Analog agent learning to HMMs. (Middle) A cartoon illustration of the more challenging task. No stimulus is presented on either side; instead, the reward probabilities for left and right choices evolve independently via random walks. As the optimal choice changes over time, the rat must learn and adapt its decisions based solely on the history of past rewards. (Right) LLM I… view at source ↗

**Figure 8.** Figure 8: The singular value decomposition of ergodic unichain Markov matrix [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Accuracies of six methods across different [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Hellinger distances of six methods across different [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Accuracies of six methods across different mixing rates ( [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Hellinger distances of six methods across different mixing rates ( [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Accuracies of six methods across different steady state distributions, [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 14.** Figure 14: Hellinger distances of six methods across different steady state distributions, [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: Accuracies of seven models across different [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗

**Figure 16.** Figure 16: Hellinger distances of seven models across different [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗

**Figure 17.** Figure 17: Accuracies of seven models across different mixing rates ( [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗

**Figure 18.** Figure 18: Hellinger distances of seven models across different mixing rates ( [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗

**Figure 19.** Figure 19: Accuracies of seven models across different steady state distributions, [PITH_FULL_IMAGE:figures/full_fig_p031_19.png] view at source ↗

**Figure 20.** Figure 20: Hellinger distances of seven models across different steady state distributions, [PITH_FULL_IMAGE:figures/full_fig_p032_20.png] view at source ↗

**Figure 21.** Figure 21: Accuracy of three tokenization methods across different mixing rates ( [PITH_FULL_IMAGE:figures/full_fig_p033_21.png] view at source ↗

**Figure 22.** Figure 22: Hellinger distance of three tokenization methods across different mixing rates ( [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗

**Figure 23.** Figure 23: LLM in-context learning prediction accuracy for mice decision-making task with varying [PITH_FULL_IMAGE:figures/full_fig_p040_23.png] view at source ↗

read the original abstract

Hidden Markov Models (HMMs) are foundational tools for modeling sequential data with latent Markovian structure, yet fitting them to real-world data remains computationally challenging. In this work, we show that pre-trained large language models (LLMs) can effectively model data generated by HMMs via in-context learning (ICL)$\unicode{x2013}$their ability to infer patterns from examples within a prompt. On a diverse set of synthetic HMMs, LLMs achieve predictive accuracy approaching the theoretical optimum. We uncover novel scaling trends influenced by HMM properties, and offer theoretical conjectures for these empirical observations. We also provide practical guidelines for scientists on using ICL as a diagnostic tool for complex data. On real-world animal decision-making tasks, ICL achieves competitive performance with models designed by human experts. To our knowledge, this is the first demonstration that ICL can learn and predict HMM-generated sequences$\unicode{x2013}$an advance that deepens our understanding of in-context learning in LLMs and establishes its potential as a powerful tool for uncovering hidden structure in complex scientific data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that pre-trained large language models (LLMs) can effectively model data generated by Hidden Markov Models (HMMs) via in-context learning (ICL). On a diverse set of synthetic HMMs, LLMs achieve predictive accuracy approaching the theoretical optimum. The work uncovers novel scaling trends influenced by HMM properties, offers theoretical conjectures, provides practical guidelines for using ICL as a diagnostic tool, and demonstrates competitive performance on real-world animal decision-making tasks.

Significance. If the empirical findings are robust and the theoretical optimum is confirmed to be the exact marginal predictive distribution from the ground-truth HMM, this would represent a significant advance in understanding in-context learning. It suggests LLMs can perform implicit Bayesian filtering on latent Markov chains, which has implications for both LLM theory and applications in scientific data analysis. The controlled synthetic experiments and real-world validation are positive aspects. The paper also earns credit for exploring scaling behaviors and providing guidelines.

major comments (2)

[Experimental Evaluation] The definition of the 'theoretical optimum' requires clarification. The paper should specify in the methods or experimental section whether this is computed as the exact next-token probability using the forward algorithm with the known transition matrix A and emission matrix B on the prompt sequence. If it relies on an approximate method like fitting an HMM to the examples or using a non-marginalized baseline, the claim that LLMs approach the HMM optimum and thus learn the hidden structure does not fully hold. This is central to validating the main result.
[Theoretical Conjectures] The conjectures for the observed scaling trends (e.g., dependence on number of hidden states or transition entropy) should be stated more precisely, ideally with a supporting derivation or connection to ICL literature, rather than purely empirical observation.

minor comments (2)

[Abstract] Consider adding a brief mention of the models tested (e.g., specific LLM families) and the range of HMM complexities to give readers a quicker sense of the scope.
[Figures] Improve clarity of plots showing scaling trends by including error bars from multiple runs and clearly labeling what the 'optimum' line represents.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments and constructive feedback. We address each major comment point by point below, with revisions planned where appropriate.

read point-by-point responses

Referee: [Experimental Evaluation] The definition of the 'theoretical optimum' requires clarification. The paper should specify in the methods or experimental section whether this is computed as the exact next-token probability using the forward algorithm with the known transition matrix A and emission matrix B on the prompt sequence. If it relies on an approximate method like fitting an HMM to the examples or using a non-marginalized baseline, the claim that LLMs approach the HMM optimum and thus learn the hidden structure does not fully hold. This is central to validating the main result.

Authors: We thank the referee for highlighting this point. In the experiments, the theoretical optimum is computed exactly as the next-token probability via the forward algorithm using the known transition matrix A and emission matrix B on the full prompt sequence, corresponding to the exact marginal predictive distribution of the ground-truth HMM. We agree that the methods section did not make this explicit enough. We will revise the manuscript to include a precise description of this computation, along with the relevant equations, to confirm it is the exact optimum rather than an approximation. revision: yes
Referee: [Theoretical Conjectures] The conjectures for the observed scaling trends (e.g., dependence on number of hidden states or transition entropy) should be stated more precisely, ideally with a supporting derivation or connection to ICL literature, rather than purely empirical observation.

Authors: We appreciate this recommendation. The conjectures in the current manuscript are based on systematic empirical observations across varied HMM parameters. We will revise the relevant section to formulate the conjectures more precisely and to draw explicit connections to existing ICL literature on implicit Bayesian inference and transformer scaling behaviors. A complete theoretical derivation lies outside the scope of this work, but the revision will strengthen the presentation and grounding of these observations. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical performance claims use independent ground-truth HMM benchmark

full rationale

The paper reports empirical results on synthetic HMM sequences where LLM in-context predictions are compared to the theoretical optimum defined by the known transition and emission matrices via the forward algorithm. No equations, derivations, or fitted parameters are presented that reduce to self-definition or self-citation by construction. Scaling trends and conjectures are offered as post-hoc interpretations of observed performance, not as load-bearing premises. The central claim is externally falsifiable against the ground-truth HMM predictor and does not rely on any self-referential loop or ansatz smuggled through prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; evaluation therefore records an empty ledger with note that full paper may contain additional modeling assumptions.

pith-pipeline@v0.9.0 · 5727 in / 941 out tokens · 21857 ms · 2026-05-19T10:29:58.135370+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

On a diverse set of synthetic HMMs, LLMs achieve predictive accuracy approaching the theoretical optimum... scaling trends influenced by HMM properties... spectral learning algorithm
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 (Informal) ... t ≳ 1/(1-λ₂(A)) ... Hellinger distance ≤ ε

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 4 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

A method of moments for mixture models and hidden markov models

Animashree Anandkumar, Daniel Hsu, and Sham M Kakade. A method of moments for mixture models and hidden markov models. In Conference on learning theory , pages 33–1. JMLR Workshop and Conference Proceedings, 2012

work page 2012
[3]

Mice alternate between discrete strategies during perceptual decision-making

Zoe C Ashwood, Nicholas A Roy, Iris R Stone, International Brain Laboratory, Anne E Urai, Anne K Churchland, Alexandre Pouget, and Jonathan W Pillow. Mice alternate between discrete strategies during perceptual decision-making. Nature Neuroscience, 25(2):201–212, 2022

work page 2022
[4]

Vector- based navigation using grid-like representations in artificial agents

Andrea Banino, Caswell Barry, Benigno Uria, Charles Blundell, Timothy Lillicrap, Piotr Mirowski, Alexander Pritzel, Martin J Chadwick, Thomas Degris, Joseph Modayil, et al. Vector- based navigation using grid-like representations in artificial agents. Nature, 557(7705):429–433, 2018

work page 2018
[5]

A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains

Leonard E Baum, Ted Petrie, George Soules, and Norman Weiss. A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The annals of mathematical statistics, 41(1):164–171, 1970

work page 1970
[6]

Birth of a transformer: A memory viewpoint

Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, and Leon Bottou. Birth of a transformer: A memory viewpoint. Advances in Neural Information Processing Systems, 36: 1560–1588, 2023

work page 2023
[7]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[8]

Springer Series in Statistics

Olivier Cappé, Eric Moulines, and Tobias Rydén.Inference in Hidden Markov Models. Springer Series in Statistics. Springer, New York, NY , 1st edition, 2005. ISBN 978-0-387-40264-2. doi: 10.1007/0-387-28982-8

work page doi:10.1007/0-387-28982-8 2005
[9]

George Casella and Edward I. George. Explaining the gibbs sampler, 1992

work page 1992
[10]

Discovering symbolic cognitive models from human and animal behavior

Pablo Samuel Castro, Nenad Tomasev, Ankit Anand, Navodita Sharma, Rishika Mohanta, Aparna Dev, Kuba Perlin, Siddhant Jain, Kyle Levin, Noémi Éltet ˝o, Will Dabney, Alexan- der Novikov, Glenn C Turner, Maria K Eckstein, Nathaniel D Daw, Kevin J Miller, and Kimberly L Stachenfeld. Discovering symbolic cognitive models from human and animal behavior. bioRxiv...

work page doi:10.1101/2025.02.05.636732 2025
[11]

Stephanie C. Y . Chan, Adam Santoro, Andrew K. Lampinen, Jane X. Wang, Aaditya Singh, Pierre H. Richemond, Jay McClelland, and Felix Hill. Data distributional properties drive emergent in-context learning in transformers, 2022. URL https://arxiv.org/abs/2205. 05055

work page 2022
[12]

Cover and Joy A

Thomas M. Cover and Joy A. Thomas.Elements of Information Theory (Wiley Series in Telecom- munications and Signal Processing). Wiley-Interscience, USA, 2006. ISBN 0471241954

work page 2006
[13]

Edelman, Ezra Edelman, Surbhi Goel, Eran Malach, and Nikolaos Tsilivis

Benjamin L. Edelman, Ezra Edelman, Surbhi Goel, Eran Malach, and Nikolaos Tsilivis. The evolution of statistical induction heads: In-context learning markov chains, 2024. URL https: //arxiv.org/abs/2402.11004

work page arXiv 2024
[14]

Ephraim and N

Y . Ephraim and N. Merhav. Hidden markov processes. IEEE Transactions on Information Theory, 48(6):1518–1569, 2002. doi: 10.1109/TIT.2002.1003838

work page doi:10.1109/tit.2002.1003838 2002
[15]

Discrete stochastic processes

Robert G Gallager. Discrete stochastic processes. Journal of the Operational Research Society, 48(1):103–103, 1997

work page 1997
[16]

Stochastic relaxation, gibbs distributions, and the bayesian restoration of images

Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on pattern analysis and machine intelligence, (6): 721–741, 1984. 11

work page 1984
[17]

Hidden markov models: Pitfalls and opportunities in ecology

Richard Glennie, Timo Adam, Vianey Leos-Barajas, Théo Michelot, Theoni Photopoulou, and Brett T McClintock. Hidden markov models: Pitfalls and opportunities in ecology. Methods in Ecology and Evolution, 14(1):43–56, 2023

work page 2023
[18]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Large language models are zero-shot time series forecasters, 2024

Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew Gordon Wilson. Large language models are zero-shot time series forecasters, 2024. URL https://arxiv.org/abs/2310.07820

work page arXiv 2024
[20]

Enough coin flips can make llms act bayesian

Ritwik Gupta, Rodolfo Corona, Jiaxin Ge, Eric Wang, Dan Klein, Trevor Darrell, and David M Chan. Enough coin flips can make llms act bayesian. arXiv preprint arXiv:2503.04722, 2025

work page arXiv 2025
[21]

A spectral algorithm for learning hidden markov models

Daniel Hsu, Sham M Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences, 78(5):1460–1480, 2012

work page 2012
[22]

Do llms dream of elephants (when told not to)? latent concept association and associative memory in transformers,

Yibo Jiang, Goutham Rajendran, Pradeep Ravikumar, and Bryon Aragam. Do llms dream of elephants (when told not to)? latent concept association and associative memory in transformers,

work page
[23]

URL https://arxiv.org/abs/2406.18400

work page arXiv
[24]

Michael I. Jordan. Attractor dynamics and parallelism in a connectionist sequential machine, page 112–127. IEEE Press, 1990. ISBN 0818620153

work page 1990
[25]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[26]

Standardized and reproducible measurement of decision-making in mice

The International Brain Laboratory, Valeria Aguillon-Rodriguez, Dora Angelaki, Hannah Bayer, Niccolo Bonacchi, Matteo Carandini, Fanny Cazettes, Gaelle Chapuis, Anne K Churchland, Yang Dan, Eric Dewitt, Mayo Faulkner, Hamish Forrest, Laura Haetzel, Michael Häusser, Sonja B Hofer, Fei Hu, Anup Khanal, Christopher Krasniak, Ines Laranjeira, Zachary F Mainen...

work page doi:10.7554/elife.63711 2021
[27]

Markov chains and mixing times , volume 107

David A Levin and Yuval Peres. Markov chains and mixing times , volume 107. American Mathematical Soc., 2017

work page 2017
[28]

Trans- formers as algorithms: Generalization and stability in in-context learning

Yingcong Li, Muhammed Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Trans- formers as algorithms: Generalization and stability in in-context learning. In International conference on machine learning, pages 19565–19594. PMLR, 2023

work page 2023
[29]

Observability and reconstructibility of hidden markov models: Implications for control and network congestion control

Andrew R Liu and Robert R Bitmead. Observability and reconstructibility of hidden markov models: Implications for control and network congestion control. In 49th IEEE Conference on Decision and Control (CDC), pages 918–923. IEEE, 2010

work page 2010
[30]

Toni J. B. Liu, Nicolas Boullé, Raphaël Sarfati, and Christopher J. Earls. Llms learn governing principles of dynamical systems, revealing an in-context neural scaling law, 2024. URL https://arxiv.org/abs/2402.00795

work page arXiv 2024
[31]

Toni J. B. Liu, Nicolas Boullé, Raphaël Sarfati, and Christopher J. Earls. Density estimation with llms: a geometric investigation of in-context learning trajectories, 2025. URL https: //arxiv.org/abs/2410.05218

work page arXiv 2025
[32]

Bridging the usability gap: Theoretical and methodological advances for spectral learning of hidden markov models

Xiaoyuan Ma and Jordan Rodu. Bridging the usability gap: Theoretical and methodological advances for spectral learning of hidden markov models. arXiv preprint arXiv:2302.07437, 2023. 12

work page arXiv 2023
[33]

How hidden are hidden processes? a primer on crypticity and entropy convergence

John R Mahoney, Christopher J Ellison, Ryan G James, and James P Crutchfield. How hidden are hidden processes? a primer on crypticity and entropy convergence. Chaos: An Interdisciplinary Journal of Nonlinear Science, 21(3), 2011

work page 2011
[34]

Attention with markov: A framework for principled analysis of transformers via markov chains, 2024

Ashok Vardhan Makkuva, Marco Bondaschi, Adway Girish, Alliot Nagle, Martin Jaggi, Hyeji Kim, and Michael Gastpar. Attention with markov: A framework for principled analysis of transformers via markov chains, 2024. URL https://arxiv.org/abs/2402.04161

work page arXiv 2024
[35]

Uncovering ecological state dynamics with hidden markov models

Brett T McClintock, Roland Langrock, Olivier Gimenez, Emmanuelle Cam, David L Borchers, Richard Glennie, and Toby A Patterson. Uncovering ecological state dynamics with hidden markov models. Ecology letters, 23(12):1878–1903, 2020

work page 1903
[36]

Bernstein inequality and moderate deviations under strong mixing conditions

Florence Merlevède, Magda Peligrad, and Emmanuel Rio. Bernstein inequality and moderate deviations under strong mixing conditions. In High dimensional probability V: the Luminy volume, volume 5, pages 273–293. Institute of Mathematical Statistics, 2009

work page 2009
[37]

Miller, Matthew M

Kevin J. Miller, Matthew M. Botvinick, and Carlos D. Brody. From predictive models to cognitive models: Separable behavioral processes underlying reward learning in the rat.bioRxiv,

work page
[38]

URL https://www.biorxiv.org/content/early/2021/02/ 19/461129

doi: 10.1101/461129. URL https://www.biorxiv.org/content/early/2021/02/ 19/461129

work page doi:10.1101/461129 2021
[39]

Optimal regularization can mitigate double descent

Preetum Nakkiran, Prayaag Venkat, Sham Kakade, and Tengyu Ma. Optimal regularization can mitigate double descent. arXiv preprint arXiv:2003.01897, 2020

work page arXiv 2003
[40]

2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious. arXiv preprint arXiv:2501.00656, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

A tutorial on hidden markov models and selected applications in speech recognition

Lawrence R Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989

work page 1989
[42]

Transformers on markov data: Constant depth suffices, 2024

Nived Rajaraman, Marco Bondaschi, Kannan Ramchandran, Michael Gastpar, and Ashok Vard- han Makkuva. Transformers on markov data: Constant depth suffices, 2024. URL https: //arxiv.org/abs/2407.17686

work page arXiv 2024
[43]

An analysis of tokenization: Trans- formers under markov data

Nived Rajaraman, Jiantao Jiao, and Kannan Ramchandran. An analysis of tokenization: Trans- formers under markov data. Advances in Neural Information Processing Systems, 37:62503– 62556, 2024

work page 2024
[44]

Spectral estimation of hidden Markov models

Jordan Rodu. Spectral estimation of hidden Markov models. University of Pennsylvania, 2014

work page 2014
[45]

Mice in a labyrinth show rapid learning, sudden insight, and efficient exploration

Matthew Rosenberg, Tony Zhang, Pietro Perona, and Markus Meister. Mice in a labyrinth show rapid learning, sudden insight, and efficient exploration. eLife, 10:e66175, jul 2021. ISSN 2050-084X. doi: 10.7554/eLife.66175. URL https://doi.org/10.7554/eLife.66175

work page doi:10.7554/elife.66175 2021
[46]

The mouse action recognition system (mars) software pipeline for automated analysis of social behaviors in mice

Cristina Segalin, Jalani Williams, Tomomi Karigo, May Hui, Moriel Zelikowsky, Jennifer J Sun, Pietro Perona, David J Anderson, and Ann Kennedy. The mouse action recognition system (mars) software pipeline for automated analysis of social behaviors in mice. eLife, 10:e63720, nov 2021. ISSN 2050-084X. doi: 10.7554/eLife.63720. URL https://doi.org/10.7554/ e...

work page doi:10.7554/elife.63720 2021
[47]

Improper learning for non-stochastic control

Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-stochastic control. In Conference on Learning Theory, pages 3320–3436. PMLR, 2020

work page 2020
[48]

Sun, Ann Kennedy, Eric Zhan, David J

Jennifer J. Sun, Ann Kennedy, Eric Zhan, David J. Anderson, Yisong Yue, and Pietro Perona. Task programming: Learning data efficient behavior representations, 2021. URL https: //arxiv.org/abs/2011.13917

work page arXiv 2021
[49]

Fitzgerald, and Nelson Spruston

Weinan Sun, Johan Winnubst, Maanasa Natrajan, Chongxi Lai, Koichiro Kajikawa, Michalis Michaelos, Rachel Gattoni, James E. Fitzgerald, and Nelson Spruston. Learning produces a hippocampal cognitive map in the form of an orthogonalized state machine. bioRxiv, 2023. doi: 10.1101/2023.08.03.551900. URL https://www.biorxiv.org/content/early/2023/ 08/07/2023.0...

work page doi:10.1101/2023.08.03.551900 2023
[50]

Facemap: a framework for modeling neural activity based on orofacial tracking

Atika Syeda, Lin Zhong, Renee Tung, Will Long, Marius Pachitariu, and Carsen Stringer. Facemap: a framework for modeling neural activity based on orofacial tracking. Nature neuroscience, 27(1):187–195, 2024

work page 2024
[51]

Are language models actually useful for time series forecasting? Advances in Neural Information Processing Systems, 37:60162–60191, 2024

Mingtian Tan, Mike Merrill, Vinayak Gupta, Tim Althoff, and Tom Hartvigsen. Are language models actually useful for time series forecasting? Advances in Neural Information Processing Systems, 37:60162–60191, 2024

work page 2024
[52]

Quinn, Benjamin A.E

Diego Vidaurre, Laurence T Hunt, Andrew J. Quinn, Benjamin A.E. Hunt, Matthew J. Brookes, Anna C. Nobre, and Mark W. Woolrich. Spontaneous cortical activity transiently organises into frequency specific phase-coupling networks. bioRxiv, 2017. doi: 10.1101/150607. URL https://www.biorxiv.org/content/early/2017/10/20/150607

work page doi:10.1101/150607 2017
[53]

A. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2):260–269, 1967. doi: 10.1109/TIT. 1967.1054010

work page doi:10.1109/tit 1967
[54]

Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning, 2024

Xinyi Wang, Wanrong Zhu, Michael Saxon, Mark Steyvers, and William Yang Wang. Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning, 2024. URL https://arxiv.org/abs/2301.11916

work page arXiv 2024
[55]

Larger language models do in-context learning differently, 2023

Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, and Tengyu Ma. Larger language models do in-context learning differently, 2023. URL https://arxiv.org/abs/2303.03846

work page arXiv 2023
[56]

Wills, Colin Lever, Francesca Cacucci, Neil Burgess, and John O’Keefe

Thomas J. Wills, Colin Lever, Francesca Cacucci, Neil Burgess, and John O’Keefe. Attractor dynamics in the hippocampal representation of the local environment. Science, 308:873 – 876,

work page
[57]

URL https://api.semanticscholar.org/CorpusID:13909368

work page
[58]

An explanation of in- context learning as implicit bayesian inference, 2022

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in- context learning as implicit bayesian inference, 2022. URL https://arxiv.org/abs/2111. 02080

work page 2022
[59]

Wainwright

Fanny Yang, Sivaraman Balakrishnan, and Martin J. Wainwright. Statistical and computa- tional guarantees for the baum-welch algorithm, 2015. URL https://arxiv.org/abs/1512. 08269

work page 2015
[60]

convergence

Walter Zucchini and Peter Guttorp. A hidden markov model for space-time precipitation. Water Resources Research, 27(8):1917–1923, 1991. 14 Appendices Table of Contents • Appendix A: Additional Background on HMMs • Appendix B: Additional Details of Experimental Setup • Appendix C: Details of Benchmark Models • Appendix D: Additional Synthetic Experiment Re...

work page 1917
[61]

showed that, with probability at least 1 − δ, we have, ∥ ˆP(⊥) 1 − P1∥ ≲ q log(1/δ) ¯N + q 1 ¯N . In the following, we will upper bound the term ∥ ˆP1 − ˆP(⊥) 1 ∥ by considering entry-wise concentration of each ℓ-th subtrajectory as follows: We have [ ˆP(ℓ) 1 ]i − [ ˆP(⊥) 1 ]i = P ¯N k=1 1{okT −ℓ=i} − 1{o(k) T =i} ¯N . (8) First, we observe that E h 1{okT...

work page
[62]

choice only

Moreover, |1{okT −ℓ=i} − 1{o(k) T =i}| ≤ 1, almost surely. However, the summation in(8) has weakly dependent terms. Therefore, we use the Bernstein type inequality for a class of weakly dependent and bounded random variables proposed in [35]. Before that, we need to upper bound the variance of the summation in (8). Observing that E h [ ˆP(ℓ) 1 ]i − [ ˆP(⊥...

work page

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

A method of moments for mixture models and hidden markov models

Animashree Anandkumar, Daniel Hsu, and Sham M Kakade. A method of moments for mixture models and hidden markov models. In Conference on learning theory , pages 33–1. JMLR Workshop and Conference Proceedings, 2012

work page 2012

[3] [3]

Mice alternate between discrete strategies during perceptual decision-making

Zoe C Ashwood, Nicholas A Roy, Iris R Stone, International Brain Laboratory, Anne E Urai, Anne K Churchland, Alexandre Pouget, and Jonathan W Pillow. Mice alternate between discrete strategies during perceptual decision-making. Nature Neuroscience, 25(2):201–212, 2022

work page 2022

[4] [4]

Vector- based navigation using grid-like representations in artificial agents

Andrea Banino, Caswell Barry, Benigno Uria, Charles Blundell, Timothy Lillicrap, Piotr Mirowski, Alexander Pritzel, Martin J Chadwick, Thomas Degris, Joseph Modayil, et al. Vector- based navigation using grid-like representations in artificial agents. Nature, 557(7705):429–433, 2018

work page 2018

[5] [5]

A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains

Leonard E Baum, Ted Petrie, George Soules, and Norman Weiss. A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The annals of mathematical statistics, 41(1):164–171, 1970

work page 1970

[6] [6]

Birth of a transformer: A memory viewpoint

Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, and Leon Bottou. Birth of a transformer: A memory viewpoint. Advances in Neural Information Processing Systems, 36: 1560–1588, 2023

work page 2023

[7] [7]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901

[8] [8]

Springer Series in Statistics

Olivier Cappé, Eric Moulines, and Tobias Rydén.Inference in Hidden Markov Models. Springer Series in Statistics. Springer, New York, NY , 1st edition, 2005. ISBN 978-0-387-40264-2. doi: 10.1007/0-387-28982-8

work page doi:10.1007/0-387-28982-8 2005

[9] [9]

George Casella and Edward I. George. Explaining the gibbs sampler, 1992

work page 1992

[10] [10]

Discovering symbolic cognitive models from human and animal behavior

Pablo Samuel Castro, Nenad Tomasev, Ankit Anand, Navodita Sharma, Rishika Mohanta, Aparna Dev, Kuba Perlin, Siddhant Jain, Kyle Levin, Noémi Éltet ˝o, Will Dabney, Alexan- der Novikov, Glenn C Turner, Maria K Eckstein, Nathaniel D Daw, Kevin J Miller, and Kimberly L Stachenfeld. Discovering symbolic cognitive models from human and animal behavior. bioRxiv...

work page doi:10.1101/2025.02.05.636732 2025

[11] [11]

Stephanie C. Y . Chan, Adam Santoro, Andrew K. Lampinen, Jane X. Wang, Aaditya Singh, Pierre H. Richemond, Jay McClelland, and Felix Hill. Data distributional properties drive emergent in-context learning in transformers, 2022. URL https://arxiv.org/abs/2205. 05055

work page 2022

[12] [12]

Cover and Joy A

Thomas M. Cover and Joy A. Thomas.Elements of Information Theory (Wiley Series in Telecom- munications and Signal Processing). Wiley-Interscience, USA, 2006. ISBN 0471241954

work page 2006

[13] [13]

Edelman, Ezra Edelman, Surbhi Goel, Eran Malach, and Nikolaos Tsilivis

Benjamin L. Edelman, Ezra Edelman, Surbhi Goel, Eran Malach, and Nikolaos Tsilivis. The evolution of statistical induction heads: In-context learning markov chains, 2024. URL https: //arxiv.org/abs/2402.11004

work page arXiv 2024

[14] [14]

Ephraim and N

Y . Ephraim and N. Merhav. Hidden markov processes. IEEE Transactions on Information Theory, 48(6):1518–1569, 2002. doi: 10.1109/TIT.2002.1003838

work page doi:10.1109/tit.2002.1003838 2002

[15] [15]

Discrete stochastic processes

Robert G Gallager. Discrete stochastic processes. Journal of the Operational Research Society, 48(1):103–103, 1997

work page 1997

[16] [16]

Stochastic relaxation, gibbs distributions, and the bayesian restoration of images

Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on pattern analysis and machine intelligence, (6): 721–741, 1984. 11

work page 1984

[17] [17]

Hidden markov models: Pitfalls and opportunities in ecology

Richard Glennie, Timo Adam, Vianey Leos-Barajas, Théo Michelot, Theoni Photopoulou, and Brett T McClintock. Hidden markov models: Pitfalls and opportunities in ecology. Methods in Ecology and Evolution, 14(1):43–56, 2023

work page 2023

[18] [18]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Large language models are zero-shot time series forecasters, 2024

Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew Gordon Wilson. Large language models are zero-shot time series forecasters, 2024. URL https://arxiv.org/abs/2310.07820

work page arXiv 2024

[20] [20]

Enough coin flips can make llms act bayesian

Ritwik Gupta, Rodolfo Corona, Jiaxin Ge, Eric Wang, Dan Klein, Trevor Darrell, and David M Chan. Enough coin flips can make llms act bayesian. arXiv preprint arXiv:2503.04722, 2025

work page arXiv 2025

[21] [21]

A spectral algorithm for learning hidden markov models

Daniel Hsu, Sham M Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences, 78(5):1460–1480, 2012

work page 2012

[22] [22]

Do llms dream of elephants (when told not to)? latent concept association and associative memory in transformers,

Yibo Jiang, Goutham Rajendran, Pradeep Ravikumar, and Bryon Aragam. Do llms dream of elephants (when told not to)? latent concept association and associative memory in transformers,

work page

[23] [23]

URL https://arxiv.org/abs/2406.18400

work page arXiv

[24] [24]

Michael I. Jordan. Attractor dynamics and parallelism in a connectionist sequential machine, page 112–127. IEEE Press, 1990. ISBN 0818620153

work page 1990

[25] [25]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020

[26] [26]

Standardized and reproducible measurement of decision-making in mice

The International Brain Laboratory, Valeria Aguillon-Rodriguez, Dora Angelaki, Hannah Bayer, Niccolo Bonacchi, Matteo Carandini, Fanny Cazettes, Gaelle Chapuis, Anne K Churchland, Yang Dan, Eric Dewitt, Mayo Faulkner, Hamish Forrest, Laura Haetzel, Michael Häusser, Sonja B Hofer, Fei Hu, Anup Khanal, Christopher Krasniak, Ines Laranjeira, Zachary F Mainen...

work page doi:10.7554/elife.63711 2021

[27] [27]

Markov chains and mixing times , volume 107

David A Levin and Yuval Peres. Markov chains and mixing times , volume 107. American Mathematical Soc., 2017

work page 2017

[28] [28]

Trans- formers as algorithms: Generalization and stability in in-context learning

Yingcong Li, Muhammed Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Trans- formers as algorithms: Generalization and stability in in-context learning. In International conference on machine learning, pages 19565–19594. PMLR, 2023

work page 2023

[29] [29]

Observability and reconstructibility of hidden markov models: Implications for control and network congestion control

Andrew R Liu and Robert R Bitmead. Observability and reconstructibility of hidden markov models: Implications for control and network congestion control. In 49th IEEE Conference on Decision and Control (CDC), pages 918–923. IEEE, 2010

work page 2010

[30] [30]

Toni J. B. Liu, Nicolas Boullé, Raphaël Sarfati, and Christopher J. Earls. Llms learn governing principles of dynamical systems, revealing an in-context neural scaling law, 2024. URL https://arxiv.org/abs/2402.00795

work page arXiv 2024

[31] [31]

Toni J. B. Liu, Nicolas Boullé, Raphaël Sarfati, and Christopher J. Earls. Density estimation with llms: a geometric investigation of in-context learning trajectories, 2025. URL https: //arxiv.org/abs/2410.05218

work page arXiv 2025

[32] [32]

Bridging the usability gap: Theoretical and methodological advances for spectral learning of hidden markov models

Xiaoyuan Ma and Jordan Rodu. Bridging the usability gap: Theoretical and methodological advances for spectral learning of hidden markov models. arXiv preprint arXiv:2302.07437, 2023. 12

work page arXiv 2023

[33] [33]

How hidden are hidden processes? a primer on crypticity and entropy convergence

John R Mahoney, Christopher J Ellison, Ryan G James, and James P Crutchfield. How hidden are hidden processes? a primer on crypticity and entropy convergence. Chaos: An Interdisciplinary Journal of Nonlinear Science, 21(3), 2011

work page 2011

[34] [34]

Attention with markov: A framework for principled analysis of transformers via markov chains, 2024

Ashok Vardhan Makkuva, Marco Bondaschi, Adway Girish, Alliot Nagle, Martin Jaggi, Hyeji Kim, and Michael Gastpar. Attention with markov: A framework for principled analysis of transformers via markov chains, 2024. URL https://arxiv.org/abs/2402.04161

work page arXiv 2024

[35] [35]

Uncovering ecological state dynamics with hidden markov models

Brett T McClintock, Roland Langrock, Olivier Gimenez, Emmanuelle Cam, David L Borchers, Richard Glennie, and Toby A Patterson. Uncovering ecological state dynamics with hidden markov models. Ecology letters, 23(12):1878–1903, 2020

work page 1903

[36] [36]

Bernstein inequality and moderate deviations under strong mixing conditions

Florence Merlevède, Magda Peligrad, and Emmanuel Rio. Bernstein inequality and moderate deviations under strong mixing conditions. In High dimensional probability V: the Luminy volume, volume 5, pages 273–293. Institute of Mathematical Statistics, 2009

work page 2009

[37] [37]

Miller, Matthew M

Kevin J. Miller, Matthew M. Botvinick, and Carlos D. Brody. From predictive models to cognitive models: Separable behavioral processes underlying reward learning in the rat.bioRxiv,

work page

[38] [38]

URL https://www.biorxiv.org/content/early/2021/02/ 19/461129

doi: 10.1101/461129. URL https://www.biorxiv.org/content/early/2021/02/ 19/461129

work page doi:10.1101/461129 2021

[39] [39]

Optimal regularization can mitigate double descent

Preetum Nakkiran, Prayaag Venkat, Sham Kakade, and Tengyu Ma. Optimal regularization can mitigate double descent. arXiv preprint arXiv:2003.01897, 2020

work page arXiv 2003

[40] [40]

2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious. arXiv preprint arXiv:2501.00656, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

A tutorial on hidden markov models and selected applications in speech recognition

Lawrence R Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989

work page 1989

[42] [42]

Transformers on markov data: Constant depth suffices, 2024

Nived Rajaraman, Marco Bondaschi, Kannan Ramchandran, Michael Gastpar, and Ashok Vard- han Makkuva. Transformers on markov data: Constant depth suffices, 2024. URL https: //arxiv.org/abs/2407.17686

work page arXiv 2024

[43] [43]

An analysis of tokenization: Trans- formers under markov data

Nived Rajaraman, Jiantao Jiao, and Kannan Ramchandran. An analysis of tokenization: Trans- formers under markov data. Advances in Neural Information Processing Systems, 37:62503– 62556, 2024

work page 2024

[44] [44]

Spectral estimation of hidden Markov models

Jordan Rodu. Spectral estimation of hidden Markov models. University of Pennsylvania, 2014

work page 2014

[45] [45]

Mice in a labyrinth show rapid learning, sudden insight, and efficient exploration

Matthew Rosenberg, Tony Zhang, Pietro Perona, and Markus Meister. Mice in a labyrinth show rapid learning, sudden insight, and efficient exploration. eLife, 10:e66175, jul 2021. ISSN 2050-084X. doi: 10.7554/eLife.66175. URL https://doi.org/10.7554/eLife.66175

work page doi:10.7554/elife.66175 2021

[46] [46]

The mouse action recognition system (mars) software pipeline for automated analysis of social behaviors in mice

Cristina Segalin, Jalani Williams, Tomomi Karigo, May Hui, Moriel Zelikowsky, Jennifer J Sun, Pietro Perona, David J Anderson, and Ann Kennedy. The mouse action recognition system (mars) software pipeline for automated analysis of social behaviors in mice. eLife, 10:e63720, nov 2021. ISSN 2050-084X. doi: 10.7554/eLife.63720. URL https://doi.org/10.7554/ e...

work page doi:10.7554/elife.63720 2021

[47] [47]

Improper learning for non-stochastic control

Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-stochastic control. In Conference on Learning Theory, pages 3320–3436. PMLR, 2020

work page 2020

[48] [48]

Sun, Ann Kennedy, Eric Zhan, David J

Jennifer J. Sun, Ann Kennedy, Eric Zhan, David J. Anderson, Yisong Yue, and Pietro Perona. Task programming: Learning data efficient behavior representations, 2021. URL https: //arxiv.org/abs/2011.13917

work page arXiv 2021

[49] [49]

Fitzgerald, and Nelson Spruston

Weinan Sun, Johan Winnubst, Maanasa Natrajan, Chongxi Lai, Koichiro Kajikawa, Michalis Michaelos, Rachel Gattoni, James E. Fitzgerald, and Nelson Spruston. Learning produces a hippocampal cognitive map in the form of an orthogonalized state machine. bioRxiv, 2023. doi: 10.1101/2023.08.03.551900. URL https://www.biorxiv.org/content/early/2023/ 08/07/2023.0...

work page doi:10.1101/2023.08.03.551900 2023

[50] [50]

Facemap: a framework for modeling neural activity based on orofacial tracking

Atika Syeda, Lin Zhong, Renee Tung, Will Long, Marius Pachitariu, and Carsen Stringer. Facemap: a framework for modeling neural activity based on orofacial tracking. Nature neuroscience, 27(1):187–195, 2024

work page 2024

[51] [51]

Are language models actually useful for time series forecasting? Advances in Neural Information Processing Systems, 37:60162–60191, 2024

Mingtian Tan, Mike Merrill, Vinayak Gupta, Tim Althoff, and Tom Hartvigsen. Are language models actually useful for time series forecasting? Advances in Neural Information Processing Systems, 37:60162–60191, 2024

work page 2024

[52] [52]

Quinn, Benjamin A.E

Diego Vidaurre, Laurence T Hunt, Andrew J. Quinn, Benjamin A.E. Hunt, Matthew J. Brookes, Anna C. Nobre, and Mark W. Woolrich. Spontaneous cortical activity transiently organises into frequency specific phase-coupling networks. bioRxiv, 2017. doi: 10.1101/150607. URL https://www.biorxiv.org/content/early/2017/10/20/150607

work page doi:10.1101/150607 2017

[53] [53]

A. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2):260–269, 1967. doi: 10.1109/TIT. 1967.1054010

work page doi:10.1109/tit 1967

[54] [54]

Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning, 2024

Xinyi Wang, Wanrong Zhu, Michael Saxon, Mark Steyvers, and William Yang Wang. Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning, 2024. URL https://arxiv.org/abs/2301.11916

work page arXiv 2024

[55] [55]

Larger language models do in-context learning differently, 2023

Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, and Tengyu Ma. Larger language models do in-context learning differently, 2023. URL https://arxiv.org/abs/2303.03846

work page arXiv 2023

[56] [56]

Wills, Colin Lever, Francesca Cacucci, Neil Burgess, and John O’Keefe

Thomas J. Wills, Colin Lever, Francesca Cacucci, Neil Burgess, and John O’Keefe. Attractor dynamics in the hippocampal representation of the local environment. Science, 308:873 – 876,

work page

[57] [57]

URL https://api.semanticscholar.org/CorpusID:13909368

work page

[58] [58]

An explanation of in- context learning as implicit bayesian inference, 2022

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in- context learning as implicit bayesian inference, 2022. URL https://arxiv.org/abs/2111. 02080

work page 2022

[59] [59]

Wainwright

Fanny Yang, Sivaraman Balakrishnan, and Martin J. Wainwright. Statistical and computa- tional guarantees for the baum-welch algorithm, 2015. URL https://arxiv.org/abs/1512. 08269

work page 2015

[60] [60]

convergence

Walter Zucchini and Peter Guttorp. A hidden markov model for space-time precipitation. Water Resources Research, 27(8):1917–1923, 1991. 14 Appendices Table of Contents • Appendix A: Additional Background on HMMs • Appendix B: Additional Details of Experimental Setup • Appendix C: Details of Benchmark Models • Appendix D: Additional Synthetic Experiment Re...

work page 1917

[61] [61]

showed that, with probability at least 1 − δ, we have, ∥ ˆP(⊥) 1 − P1∥ ≲ q log(1/δ) ¯N + q 1 ¯N . In the following, we will upper bound the term ∥ ˆP1 − ˆP(⊥) 1 ∥ by considering entry-wise concentration of each ℓ-th subtrajectory as follows: We have [ ˆP(ℓ) 1 ]i − [ ˆP(⊥) 1 ]i = P ¯N k=1 1{okT −ℓ=i} − 1{o(k) T =i} ¯N . (8) First, we observe that E h 1{okT...

work page

[62] [62]

choice only

Moreover, |1{okT −ℓ=i} − 1{o(k) T =i}| ≤ 1, almost surely. However, the summation in(8) has weakly dependent terms. Therefore, we use the Bernstein type inequality for a class of weakly dependent and bounded random variables proposed in [35]. Before that, we need to upper bound the variance of the summation in (8). Observing that E h [ ˆP(ℓ) 1 ]i − [ ˆP(⊥...

work page