pith. sign in

arxiv: 2605.15608 · v1 · pith:RMS3VONJnew · submitted 2026-05-15 · 💻 cs.LG · cs.SY· eess.SY

Transformer-like Inference from Optimal Control

Pith reviewed 2026-05-20 21:20 UTC · model grok-4.3

classification 💻 cs.LG cs.SYeess.SY
keywords decoder-only transformersoptimal controldual filternext-token predictioninference algorithmattention weightsnon-Markovian structuresequence prediction
0
0 comments X

The pith

Reformulating next-token prediction as an optimal control problem produces a dual filter whose layers mirror those of decoder-only transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that the task of computing the conditional probability of the next token can be derived from first principles by casting it as an optimal control problem. For both a nonlinear model of discrete-valued processes and a linear Gaussian model, the solution to this control problem is an inference procedure called the dual filter. The dual filter has a layered structure that directly corresponds to the architecture of a decoder-only transformer. A sympathetic reader cares because this offers a principled explanation for why transformers have the form they do, emerging from the mathematics of optimal control rather than from attention mechanisms alone.

Core claim

The paper derives inference architectures from first principles for the next-token prediction problem solved by decoder-only transformers. By reformulating the prediction objective as an optimal control problem in two model classes—a nonlinear discrete-valued process and a linear Gaussian process—the solution yields an explicit inference algorithm, the dual filter, whose layer structure mirrors that of a decoder-only transformer. Numerical experiments compare the optimal control solution to attention weights from a trained transformer and show that insufficient embedding dimension leads the transformer to exploit non-Markovian structure.

What carries the argument

The dual filter, the explicit inference algorithm obtained by solving the optimal control reformulation of the next-token prediction objective.

Load-bearing premise

The chosen nonlinear discrete-valued process and linear Gaussian process must be representative of the data-generating mechanisms that transformers are trained on.

What would settle it

Generate sequences from the nonlinear discrete-valued process, train a decoder-only transformer on those sequences, and check whether the transformer's attention weights and layer computations align with the dual filter solution; a systematic mismatch would falsify the relevance of the optimal control derivation.

Figures

Figures reproduced from arXiv: 2605.15608 by Aditya Kudre, Heng-Sheng Chang, Prashant G. Mehta.

Figure 1
Figure 1. Figure 1: Graphical model for (X, Z). For τ = 1, X is a Markov process and (X, Z) is a hidden Markov model (HMM). For τ > 1, X is a non-Markovian (or a τ th-order Markov) process. For τ = T, XT depends upon the entire past X0:T −1. At convergence, the layer transformation yields the conditional probability of the hidden state at each time step, from which the weights U are computed via an explicit formula. For both … view at source ↗
Figure 2
Figure 2. Figure 2: Two-cycle HMM and a sample observation sequence. (left) State transition graph: state [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Attention and control weight heatmaps for [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Non-Markovian advantage. (top) Cross-entropy loss vs. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training and validation loss curves for the transformer model. The loss converges to the optimal filter loss in both [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Control and attention patterns shift under model perturbation. The model parameters are perturbed via a convex [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: A sample observation trajectory and the corresponding control and attention patterns for the higher-dimensional [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
read the original abstract

Decoder-only transformers compute the conditional probability of the next token from a sequence of past observations. This paper derives, from first principles, inference architectures that solve the same prediction problem - and in doing so, recovers transformer-like layer operations as a consequence of optimal control theory. The framework is developed for two model classes: a nonlinear model of discrete-valued processes, directly motivated by the transformer, and a linear Gaussian model as a tractable baseline. For both model classes, the prediction objective is reformulated as an optimal control problem whose solution yields an explicit inference algorithm, the dual filter, with a layer structure that mirrors the layer structure of a decoder-only transformer. Numerical experiments provide a comparison of the optimal control to attention weights from a trained transformer. These experiments reveal that when the embedding dimension is insufficient, the transformer implicitly exploits non-Markovian structure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that reformulating next-token prediction as an optimal control problem for a nonlinear discrete-valued process (motivated by transformers) and a linear Gaussian process yields an explicit 'dual filter' inference algorithm whose layered structure mirrors decoder-only transformers. Numerical experiments compare the resulting weights to attention weights from a trained transformer and conclude that transformers exploit non-Markovian structure when the embedding dimension is insufficient.

Significance. If the derivations are exact and the model classes representative, the work would provide a first-principles optimal-control derivation of transformer-like inference, a notable strength given the parameter-free character of the dual-filter construction. The numerical comparison in the insufficient-embedding regime offers limited empirical grounding but does not yet test reproduction of attention patterns on data generated from the paper's own nonlinear process.

major comments (3)
  1. [§3] §3 (nonlinear model derivation): the step from the optimal-control solution to the explicit dual-filter layer operations that mirror decoder-only transformer blocks is stated as a direct consequence, yet the manuscript does not exhibit the intermediate equations showing how the control inputs produce the precise attention and feed-forward forms without hidden approximations or additional assumptions.
  2. [Numerical experiments] Numerical experiments section: the reported comparison is confined to the insufficient-embedding regime and does not include a test of whether the dual-filter controls reproduce attention patterns when data are drawn from the paper's nonlinear discrete-valued process, leaving the representativeness claim unverified.
  3. [§4] §4 (linear Gaussian baseline): while the linear case is presented as tractable, the manuscript does not quantify how closely the resulting dual filter approximates the nonlinear case or under what conditions the mirroring to transformer layers remains structurally identical.
minor comments (2)
  1. A diagram explicitly mapping dual-filter steps to transformer blocks (query/key/value, residual connections, etc.) would improve readability of the central claim.
  2. [Abstract] The abstract's phrasing that the dual filter 'mirrors' transformer layers should be qualified with the precise sense of mirroring (structural vs. functional) once the derivation is clarified.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (nonlinear model derivation): the step from the optimal-control solution to the explicit dual-filter layer operations that mirror decoder-only transformer blocks is stated as a direct consequence, yet the manuscript does not exhibit the intermediate equations showing how the control inputs produce the precise attention and feed-forward forms without hidden approximations or additional assumptions.

    Authors: We agree that the transition from the optimal-control solution to the explicit dual-filter operations would be clearer with additional intermediate steps. In the revised manuscript we will insert the missing equations that derive the precise attention and feed-forward forms directly from the control inputs, confirming that no hidden approximations are required. revision: yes

  2. Referee: [Numerical experiments] Numerical experiments section: the reported comparison is confined to the insufficient-embedding regime and does not include a test of whether the dual-filter controls reproduce attention patterns when data are drawn from the paper's nonlinear discrete-valued process, leaving the representativeness claim unverified.

    Authors: The experiments deliberately target the insufficient-embedding regime because that is where the non-Markovian behavior of trained transformers becomes visible. To strengthen the representativeness claim we will add a new set of experiments that generate sequences from the nonlinear discrete-valued process itself and directly compare the dual-filter controls against attention weights obtained from a transformer trained on those sequences. revision: yes

  3. Referee: [§4] §4 (linear Gaussian baseline): while the linear case is presented as tractable, the manuscript does not quantify how closely the resulting dual filter approximates the nonlinear case or under what conditions the mirroring to transformer layers remains structurally identical.

    Authors: Section 4 presents the linear Gaussian model as an exactly solvable baseline. In the revision we will add a quantitative comparison (e.g., via linearization error bounds) between the linear dual filter and its nonlinear counterpart, together with an explicit statement of the conditions (small nonlinearity, sufficient embedding dimension) under which the layer structure remains identical to the transformer blocks. revision: yes

Circularity Check

0 steps flagged

Derivation proceeds from assumed generative models via optimal control without reducing to fitted transformer outputs or self-citation chains

full rationale

The paper begins with two explicitly stated generative model classes (nonlinear discrete-valued process motivated by transformers, and linear Gaussian baseline), reformulates the next-token prediction objective as an optimal control problem, and derives the dual filter solution whose layer structure is shown to mirror decoder-only transformers. This chain is internal to the chosen models and optimal control theory; no step fits parameters to real transformer weights or invokes prior self-citations as load-bearing uniqueness results. The numerical experiments are presented as a separate comparison rather than part of the derivation itself. The representativeness of the model classes for actual data is an external validity question, not a circularity issue within the claimed first-principles derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of modeling the sequence process by either the nonlinear discrete-valued class or the linear Gaussian class and on the equivalence between the prediction objective and the optimal-control cost.

axioms (2)
  • domain assumption Sequence data can be generated by a nonlinear discrete-valued process or a linear Gaussian process.
    The framework is developed separately for these two model classes.
  • domain assumption The conditional-probability prediction objective can be exactly recast as a finite-horizon optimal-control problem.
    This recasting is the step that allows the dual filter to be derived.

pith-pipeline@v0.9.0 · 5676 in / 1281 out tokens · 33317 ms · 2026-05-20T21:20:41.718146+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 2 internal anchors

  1. [1]

    Formal algorithms for transformers.arXiv preprint arXiv:2207.09238, 2022

    Mary Phuong and Marcus Hutter. Formal algorithms for transformers.arXiv preprint arXiv:2207.09238, 2022

  2. [2]

    Letrouit, Y

    Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on transformers.arXiv preprint arXiv:2312.10794, 2023

  3. [3]

    Geshkovski, P

    Borjan Geshkovski, Philippe Rigollet, and Domènec Ruiz-Balet. Measure-to-measure interpola- tion using transformers.arXiv preprint arXiv:2411.04551, 2024

  4. [4]

    The asymptotic behavior of attention in transformers.arXiv preprint arXiv:2412.02682,

    Álvaro Rodríguez Abella, João Pedro Silvestre, and Paulo Tabuada. The asymptotic behavior of attention in transformers.arXiv preprint arXiv:2412.02682, 2024

  5. [5]

    Approximate controllability of continuity equation of transformers.IEEE Control Systems Letters, 2024

    Daniel Owusu Adu and Bahman Gharesifard. Approximate controllability of continuity equation of transformers.IEEE Control Systems Letters, 2024

  6. [6]

    A unified perspective on the dynamics of deep transformers.arXiv preprint arXiv:2501.18322,

    Valérie Castin, Pierre Ablin, José Antonio Carrillo, and Gabriel Peyré. A unified perspective on the dynamics of deep transformers.arXiv preprint arXiv:2501.18322, 2025

  7. [7]

    An optimal control approach to transformer training.arXiv preprint arXiv:2603.09571, 2026

    Ka˘gan Akman, Naci Saldı, and Serdar Yüksel. An optimal control approach to transformer training.arXiv preprint arXiv:2603.09571, 2026

  8. [8]

    Optimal control for transformer architectures: Enhancing generalization, robustness and efficiency

    Kelvin Kan, Xingjian Li, Benjamin Zhang, Tuhin Sahai, Stanley Osher, and Markos Kat- soulakis. Optimal control for transformer architectures: Enhancing generalization, robustness and efficiency. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  9. [9]

    Localmax dynamics for attention in transformers and its asymptotic behavior.arXiv preprint arXiv:2509.15958, 2025

    Henri Cimetière, Maria Teresa Chiri, and Bahman Gharesifard. Localmax dynamics for attention in transformers and its asymptotic behavior.arXiv preprint arXiv:2509.15958, 2025

  10. [10]

    Llms as high-dimensional nonlinear autoregressive models with attention: Training, alignment and inference.arXiv preprint arXiv:2602.00426, 2026

    Vikram Krishnamurthy. Llms as high-dimensional nonlinear autoregressive models with attention: Training, alignment and inference.arXiv preprint arXiv:2602.00426, 2026

  11. [11]

    An Explanation of In-context Learning as Implicit Bayesian Inference

    Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference.arXiv preprint arXiv:2111.02080, 2021

  12. [12]

    Transformers as statisticians: Provable in-context learning with in-context algorithm selection.Advances in neural information processing systems, 36:57125–57211, 2023

    Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as statisticians: Provable in-context learning with in-context algorithm selection.Advances in neural information processing systems, 36:57125–57211, 2023

  13. [13]

    Causal interpretation of self-attention in pre-trained transformers.Advances in Neural Information Processing Systems, 36:31450– 31465, 2023

    Raanan Y Rohekar, Yaniv Gurwicz, and Shami Nisimov. Causal interpretation of self-attention in pre-trained transformers.Advances in Neural Information Processing Systems, 36:31450– 31465, 2023

  14. [14]

    Attentive state-space modeling of disease progres- sion.Advances in neural information processing systems, 32, 2019

    Ahmed M Alaa and Mihaela van der Schaar. Attentive state-space modeling of disease progres- sion.Advances in neural information processing systems, 32, 2019

  15. [15]

    Probabilistic transformer for time series analysis.Advances in neural information processing systems, 34:23592–23608, 2021

    Binh Tang and David S Matteson. Probabilistic transformer for time series analysis.Advances in neural information processing systems, 34:23592–23608, 2021

  16. [16]

    Can a transformer represent a kalman filter? In6th Annual Learning for Dynamics & Control Conference, pages 1502–1512

    Gautam Goel and Peter Bartlett. Can a transformer represent a kalman filter? In6th Annual Learning for Dynamics & Control Conference, pages 1502–1512. PMLR, 2024

  17. [17]

    Can transformers learn optimal filtering for unknown systems?IEEE Control Systems Letters, 7:3525–3530, 2023

    Zhe Du, Haldun Balim, Samet Oymak, and Necmiye Ozay. Can transformers learn optimal filtering for unknown systems?IEEE Control Systems Letters, 7:3525–3530, 2023

  18. [18]

    Dual filter: A mathematical framework for inference using transformer-like architectures.arXiv preprint arXiv:2505.00818, 2025

    Heng-Sheng Chang and Prashant G Mehta. Dual filter: A mathematical framework for inference using transformer-like architectures.arXiv preprint arXiv:2505.00818, 2025

  19. [19]

    John Wiley & Sons, 2001

    Yaakov Bar-Shalom, X Rong Li, and Thiagalingam Kirubarajan.Estimation with applications to tracking and navigation: theory algorithms and software. John Wiley & Sons, 2001

  20. [20]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023. 10

  21. [21]

    Mehta, and Sean Meyn

    Jin Won Kim, Prashant G. Mehta, and Sean Meyn. What is the Lagrangian for nonlinear filtering? In2019 IEEE 58th Conference on Decision and Control (CDC), pages 1607–1614, Nice, France, 12 2019. IEEE

  22. [22]

    Duality for nonlinear filtering ii: Optimal control.IEEE Transactions on Automatic Control, 69(2):712–725, 2023

    Jin Won Kim and Prashant G Mehta. Duality for nonlinear filtering ii: Optimal control.IEEE Transactions on Automatic Control, 69(2):712–725, 2023

  23. [23]

    Springer, 2018

    Alain Bensoussan.Estimation and control of dynamical systems, volume 48. Springer, 2018

  24. [24]

    Prentice Hall, 2000

    Thomas Kailath, Ali H Sayed, and Babak Hassibi.Linear estimation. Prentice Hall, 2000

  25. [25]

    General duality between optimal control and estimation

    Emanuel Todorov. General duality between optimal control and estimation. In2008 47th IEEE conference on decision and control, pages 4286–4292. IEEE, 2008

  26. [26]

    The arrow of time in estimation and control: Duality theory beyond the linear gaussian model.IEEE Control Systems, 45(2):70–90, 2025

    Jin Won Kim and Prashant G Mehta. The arrow of time in estimation and control: Duality theory beyond the linear gaussian model.IEEE Control Systems, 45(2):70–90, 2025

  27. [27]

    nanogpt: The simplest, fastest repository for training/finetuning medium- sized gpts.https://github.com/karpathy/nanoGPT, 2024

    Andrej Karpathy. nanogpt: The simplest, fastest repository for training/finetuning medium- sized gpts.https://github.com/karpathy/nanoGPT, 2024

  28. [28]

    Differentiable filtering for learning hidden markov models

    Reginald Zhiyan Chen, Heng-Sheng Chang, and Prashant G Mehta. Differentiable filtering for learning hidden markov models. In8th Annual Learning for Dynamics and Control Conference, 2026. A Explicit formulae for the dual optimal control problem in Sec. 3.1 This section provides explicit expressions for the BS∆E dual control system, the optimal control obje...

  29. [29]

    Assume P([Z=z])≥c T >0 for all z∈O T , and the existence of a unique U follows directly from the earlier result

  30. [30]

    Then again, a particular selection of U follows from the above result

    Adopt the convention 0 0 = 0 to define (or extend) the conditional expectation for sample paths Z=z with P([Z=z]) = 0 . Then again, a particular selection of U follows from the above result. In the second case, however, there may be other choices of U such that the representation (4) holds: Any two choices will yield a representation that coincides on the...