Transformer-like Inference from Optimal Control
Pith reviewed 2026-05-20 21:20 UTC · model grok-4.3
The pith
Reformulating next-token prediction as an optimal control problem produces a dual filter whose layers mirror those of decoder-only transformers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper derives inference architectures from first principles for the next-token prediction problem solved by decoder-only transformers. By reformulating the prediction objective as an optimal control problem in two model classes—a nonlinear discrete-valued process and a linear Gaussian process—the solution yields an explicit inference algorithm, the dual filter, whose layer structure mirrors that of a decoder-only transformer. Numerical experiments compare the optimal control solution to attention weights from a trained transformer and show that insufficient embedding dimension leads the transformer to exploit non-Markovian structure.
What carries the argument
The dual filter, the explicit inference algorithm obtained by solving the optimal control reformulation of the next-token prediction objective.
Load-bearing premise
The chosen nonlinear discrete-valued process and linear Gaussian process must be representative of the data-generating mechanisms that transformers are trained on.
What would settle it
Generate sequences from the nonlinear discrete-valued process, train a decoder-only transformer on those sequences, and check whether the transformer's attention weights and layer computations align with the dual filter solution; a systematic mismatch would falsify the relevance of the optimal control derivation.
Figures
read the original abstract
Decoder-only transformers compute the conditional probability of the next token from a sequence of past observations. This paper derives, from first principles, inference architectures that solve the same prediction problem - and in doing so, recovers transformer-like layer operations as a consequence of optimal control theory. The framework is developed for two model classes: a nonlinear model of discrete-valued processes, directly motivated by the transformer, and a linear Gaussian model as a tractable baseline. For both model classes, the prediction objective is reformulated as an optimal control problem whose solution yields an explicit inference algorithm, the dual filter, with a layer structure that mirrors the layer structure of a decoder-only transformer. Numerical experiments provide a comparison of the optimal control to attention weights from a trained transformer. These experiments reveal that when the embedding dimension is insufficient, the transformer implicitly exploits non-Markovian structure.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that reformulating next-token prediction as an optimal control problem for a nonlinear discrete-valued process (motivated by transformers) and a linear Gaussian process yields an explicit 'dual filter' inference algorithm whose layered structure mirrors decoder-only transformers. Numerical experiments compare the resulting weights to attention weights from a trained transformer and conclude that transformers exploit non-Markovian structure when the embedding dimension is insufficient.
Significance. If the derivations are exact and the model classes representative, the work would provide a first-principles optimal-control derivation of transformer-like inference, a notable strength given the parameter-free character of the dual-filter construction. The numerical comparison in the insufficient-embedding regime offers limited empirical grounding but does not yet test reproduction of attention patterns on data generated from the paper's own nonlinear process.
major comments (3)
- [§3] §3 (nonlinear model derivation): the step from the optimal-control solution to the explicit dual-filter layer operations that mirror decoder-only transformer blocks is stated as a direct consequence, yet the manuscript does not exhibit the intermediate equations showing how the control inputs produce the precise attention and feed-forward forms without hidden approximations or additional assumptions.
- [Numerical experiments] Numerical experiments section: the reported comparison is confined to the insufficient-embedding regime and does not include a test of whether the dual-filter controls reproduce attention patterns when data are drawn from the paper's nonlinear discrete-valued process, leaving the representativeness claim unverified.
- [§4] §4 (linear Gaussian baseline): while the linear case is presented as tractable, the manuscript does not quantify how closely the resulting dual filter approximates the nonlinear case or under what conditions the mirroring to transformer layers remains structurally identical.
minor comments (2)
- A diagram explicitly mapping dual-filter steps to transformer blocks (query/key/value, residual connections, etc.) would improve readability of the central claim.
- [Abstract] The abstract's phrasing that the dual filter 'mirrors' transformer layers should be qualified with the precise sense of mirroring (structural vs. functional) once the derivation is clarified.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major comment below and indicate the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (nonlinear model derivation): the step from the optimal-control solution to the explicit dual-filter layer operations that mirror decoder-only transformer blocks is stated as a direct consequence, yet the manuscript does not exhibit the intermediate equations showing how the control inputs produce the precise attention and feed-forward forms without hidden approximations or additional assumptions.
Authors: We agree that the transition from the optimal-control solution to the explicit dual-filter operations would be clearer with additional intermediate steps. In the revised manuscript we will insert the missing equations that derive the precise attention and feed-forward forms directly from the control inputs, confirming that no hidden approximations are required. revision: yes
-
Referee: [Numerical experiments] Numerical experiments section: the reported comparison is confined to the insufficient-embedding regime and does not include a test of whether the dual-filter controls reproduce attention patterns when data are drawn from the paper's nonlinear discrete-valued process, leaving the representativeness claim unverified.
Authors: The experiments deliberately target the insufficient-embedding regime because that is where the non-Markovian behavior of trained transformers becomes visible. To strengthen the representativeness claim we will add a new set of experiments that generate sequences from the nonlinear discrete-valued process itself and directly compare the dual-filter controls against attention weights obtained from a transformer trained on those sequences. revision: yes
-
Referee: [§4] §4 (linear Gaussian baseline): while the linear case is presented as tractable, the manuscript does not quantify how closely the resulting dual filter approximates the nonlinear case or under what conditions the mirroring to transformer layers remains structurally identical.
Authors: Section 4 presents the linear Gaussian model as an exactly solvable baseline. In the revision we will add a quantitative comparison (e.g., via linearization error bounds) between the linear dual filter and its nonlinear counterpart, together with an explicit statement of the conditions (small nonlinearity, sufficient embedding dimension) under which the layer structure remains identical to the transformer blocks. revision: yes
Circularity Check
Derivation proceeds from assumed generative models via optimal control without reducing to fitted transformer outputs or self-citation chains
full rationale
The paper begins with two explicitly stated generative model classes (nonlinear discrete-valued process motivated by transformers, and linear Gaussian baseline), reformulates the next-token prediction objective as an optimal control problem, and derives the dual filter solution whose layer structure is shown to mirror decoder-only transformers. This chain is internal to the chosen models and optimal control theory; no step fits parameters to real transformer weights or invokes prior self-citations as load-bearing uniqueness results. The numerical experiments are presented as a separate comparison rather than part of the derivation itself. The representativeness of the model classes for actual data is an external validity question, not a circularity issue within the claimed first-principles derivation.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Sequence data can be generated by a nonlinear discrete-valued process or a linear Gaussian process.
- domain assumption The conditional-probability prediction objective can be exactly recast as a finite-horizon optimal-control problem.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the prediction objective is reformulated as an optimal control problem whose solution yields an explicit inference algorithm, the dual filter, with a layer structure that mirrors the layer structure of a decoder-only transformer
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
JT(u;f) := ½ |y0|²_Σ0 + ½ Σ (|yt+1|²_Q + |ut|²_R)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Formal algorithms for transformers.arXiv preprint arXiv:2207.09238, 2022
Mary Phuong and Marcus Hutter. Formal algorithms for transformers.arXiv preprint arXiv:2207.09238, 2022
-
[2]
Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on transformers.arXiv preprint arXiv:2312.10794, 2023
-
[3]
Borjan Geshkovski, Philippe Rigollet, and Domènec Ruiz-Balet. Measure-to-measure interpola- tion using transformers.arXiv preprint arXiv:2411.04551, 2024
-
[4]
The asymptotic behavior of attention in transformers.arXiv preprint arXiv:2412.02682,
Álvaro Rodríguez Abella, João Pedro Silvestre, and Paulo Tabuada. The asymptotic behavior of attention in transformers.arXiv preprint arXiv:2412.02682, 2024
-
[5]
Daniel Owusu Adu and Bahman Gharesifard. Approximate controllability of continuity equation of transformers.IEEE Control Systems Letters, 2024
work page 2024
-
[6]
A unified perspective on the dynamics of deep transformers.arXiv preprint arXiv:2501.18322,
Valérie Castin, Pierre Ablin, José Antonio Carrillo, and Gabriel Peyré. A unified perspective on the dynamics of deep transformers.arXiv preprint arXiv:2501.18322, 2025
-
[7]
An optimal control approach to transformer training.arXiv preprint arXiv:2603.09571, 2026
Ka˘gan Akman, Naci Saldı, and Serdar Yüksel. An optimal control approach to transformer training.arXiv preprint arXiv:2603.09571, 2026
-
[8]
Optimal control for transformer architectures: Enhancing generalization, robustness and efficiency
Kelvin Kan, Xingjian Li, Benjamin Zhang, Tuhin Sahai, Stanley Osher, and Markos Kat- soulakis. Optimal control for transformer architectures: Enhancing generalization, robustness and efficiency. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[9]
Henri Cimetière, Maria Teresa Chiri, and Bahman Gharesifard. Localmax dynamics for attention in transformers and its asymptotic behavior.arXiv preprint arXiv:2509.15958, 2025
-
[10]
Vikram Krishnamurthy. Llms as high-dimensional nonlinear autoregressive models with attention: Training, alignment and inference.arXiv preprint arXiv:2602.00426, 2026
-
[11]
An Explanation of In-context Learning as Implicit Bayesian Inference
Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference.arXiv preprint arXiv:2111.02080, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as statisticians: Provable in-context learning with in-context algorithm selection.Advances in neural information processing systems, 36:57125–57211, 2023
work page 2023
-
[13]
Raanan Y Rohekar, Yaniv Gurwicz, and Shami Nisimov. Causal interpretation of self-attention in pre-trained transformers.Advances in Neural Information Processing Systems, 36:31450– 31465, 2023
work page 2023
-
[14]
Ahmed M Alaa and Mihaela van der Schaar. Attentive state-space modeling of disease progres- sion.Advances in neural information processing systems, 32, 2019
work page 2019
-
[15]
Binh Tang and David S Matteson. Probabilistic transformer for time series analysis.Advances in neural information processing systems, 34:23592–23608, 2021
work page 2021
-
[16]
Gautam Goel and Peter Bartlett. Can a transformer represent a kalman filter? In6th Annual Learning for Dynamics & Control Conference, pages 1502–1512. PMLR, 2024
work page 2024
-
[17]
Zhe Du, Haldun Balim, Samet Oymak, and Necmiye Ozay. Can transformers learn optimal filtering for unknown systems?IEEE Control Systems Letters, 7:3525–3530, 2023
work page 2023
-
[18]
Heng-Sheng Chang and Prashant G Mehta. Dual filter: A mathematical framework for inference using transformer-like architectures.arXiv preprint arXiv:2505.00818, 2025
-
[19]
Yaakov Bar-Shalom, X Rong Li, and Thiagalingam Kirubarajan.Estimation with applications to tracking and navigation: theory algorithms and software. John Wiley & Sons, 2001
work page 2001
-
[20]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023. 10
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Jin Won Kim, Prashant G. Mehta, and Sean Meyn. What is the Lagrangian for nonlinear filtering? In2019 IEEE 58th Conference on Decision and Control (CDC), pages 1607–1614, Nice, France, 12 2019. IEEE
work page 2019
-
[22]
Jin Won Kim and Prashant G Mehta. Duality for nonlinear filtering ii: Optimal control.IEEE Transactions on Automatic Control, 69(2):712–725, 2023
work page 2023
-
[23]
Alain Bensoussan.Estimation and control of dynamical systems, volume 48. Springer, 2018
work page 2018
-
[24]
Thomas Kailath, Ali H Sayed, and Babak Hassibi.Linear estimation. Prentice Hall, 2000
work page 2000
-
[25]
General duality between optimal control and estimation
Emanuel Todorov. General duality between optimal control and estimation. In2008 47th IEEE conference on decision and control, pages 4286–4292. IEEE, 2008
work page 2008
-
[26]
Jin Won Kim and Prashant G Mehta. The arrow of time in estimation and control: Duality theory beyond the linear gaussian model.IEEE Control Systems, 45(2):70–90, 2025
work page 2025
-
[27]
Andrej Karpathy. nanogpt: The simplest, fastest repository for training/finetuning medium- sized gpts.https://github.com/karpathy/nanoGPT, 2024
work page 2024
-
[28]
Differentiable filtering for learning hidden markov models
Reginald Zhiyan Chen, Heng-Sheng Chang, and Prashant G Mehta. Differentiable filtering for learning hidden markov models. In8th Annual Learning for Dynamics and Control Conference, 2026. A Explicit formulae for the dual optimal control problem in Sec. 3.1 This section provides explicit expressions for the BS∆E dual control system, the optimal control obje...
work page 2026
-
[29]
Assume P([Z=z])≥c T >0 for all z∈O T , and the existence of a unique U follows directly from the earlier result
-
[30]
Then again, a particular selection of U follows from the above result
Adopt the convention 0 0 = 0 to define (or extend) the conditional expectation for sample paths Z=z with P([Z=z]) = 0 . Then again, a particular selection of U follows from the above result. In the second case, however, there may be other choices of U such that the representation (4) holds: Any two choices will yield a representation that coincides on the...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.