Transformer-like Inference from Optimal Control

Aditya Kudre; Heng-Sheng Chang; Prashant G. Mehta

arxiv: 2605.15608 · v1 · pith:RMS3VONJnew · submitted 2026-05-15 · 💻 cs.LG · cs.SY· eess.SY

Transformer-like Inference from Optimal Control

Aditya Kudre , Heng-Sheng Chang , Prashant G. Mehta This is my paper

Pith reviewed 2026-05-20 21:20 UTC · model grok-4.3

classification 💻 cs.LG cs.SYeess.SY

keywords decoder-only transformersoptimal controldual filternext-token predictioninference algorithmattention weightsnon-Markovian structuresequence prediction

0 comments

The pith

Reformulating next-token prediction as an optimal control problem produces a dual filter whose layers mirror those of decoder-only transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that the task of computing the conditional probability of the next token can be derived from first principles by casting it as an optimal control problem. For both a nonlinear model of discrete-valued processes and a linear Gaussian model, the solution to this control problem is an inference procedure called the dual filter. The dual filter has a layered structure that directly corresponds to the architecture of a decoder-only transformer. A sympathetic reader cares because this offers a principled explanation for why transformers have the form they do, emerging from the mathematics of optimal control rather than from attention mechanisms alone.

Core claim

The paper derives inference architectures from first principles for the next-token prediction problem solved by decoder-only transformers. By reformulating the prediction objective as an optimal control problem in two model classes—a nonlinear discrete-valued process and a linear Gaussian process—the solution yields an explicit inference algorithm, the dual filter, whose layer structure mirrors that of a decoder-only transformer. Numerical experiments compare the optimal control solution to attention weights from a trained transformer and show that insufficient embedding dimension leads the transformer to exploit non-Markovian structure.

What carries the argument

The dual filter, the explicit inference algorithm obtained by solving the optimal control reformulation of the next-token prediction objective.

Load-bearing premise

The chosen nonlinear discrete-valued process and linear Gaussian process must be representative of the data-generating mechanisms that transformers are trained on.

What would settle it

Generate sequences from the nonlinear discrete-valued process, train a decoder-only transformer on those sequences, and check whether the transformer's attention weights and layer computations align with the dual filter solution; a systematic mismatch would falsify the relevance of the optimal control derivation.

Figures

Figures reproduced from arXiv: 2605.15608 by Aditya Kudre, Heng-Sheng Chang, Prashant G. Mehta.

**Figure 1.** Figure 1: Graphical model for (X, Z). For τ = 1, X is a Markov process and (X, Z) is a hidden Markov model (HMM). For τ > 1, X is a non-Markovian (or a τ th-order Markov) process. For τ = T, XT depends upon the entire past X0:T −1. At convergence, the layer transformation yields the conditional probability of the hidden state at each time step, from which the weights U are computed via an explicit formula. For both … view at source ↗

**Figure 2.** Figure 2: Two-cycle HMM and a sample observation sequence. (left) State transition graph: state [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Attention and control weight heatmaps for [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Non-Markovian advantage. (top) Cross-entropy loss vs. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Training and validation loss curves for the transformer model. The loss converges to the optimal filter loss in both [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Control and attention patterns shift under model perturbation. The model parameters are perturbed via a convex [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: A sample observation trajectory and the corresponding control and attention patterns for the higher-dimensional [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

read the original abstract

Decoder-only transformers compute the conditional probability of the next token from a sequence of past observations. This paper derives, from first principles, inference architectures that solve the same prediction problem - and in doing so, recovers transformer-like layer operations as a consequence of optimal control theory. The framework is developed for two model classes: a nonlinear model of discrete-valued processes, directly motivated by the transformer, and a linear Gaussian model as a tractable baseline. For both model classes, the prediction objective is reformulated as an optimal control problem whose solution yields an explicit inference algorithm, the dual filter, with a layer structure that mirrors the layer structure of a decoder-only transformer. Numerical experiments provide a comparison of the optimal control to attention weights from a trained transformer. These experiments reveal that when the embedding dimension is insufficient, the transformer implicitly exploits non-Markovian structure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives decoder-only transformer layers as the explicit solution to an optimal control problem on two chosen generative models.

read the letter

The core result is that you can recover the layer structure of a decoder-only transformer by solving a dual optimal control filter for next-token prediction. They do this for a nonlinear discrete-valued process that they motivate as transformer-like, and for a linear Gaussian process as a simpler case. The derivation starts from the prediction objective, casts it as control, and produces inference steps that line up with attention and feed-forward blocks without presupposing the transformer equations.

Referee Report

3 major / 2 minor

Summary. The paper claims that reformulating next-token prediction as an optimal control problem for a nonlinear discrete-valued process (motivated by transformers) and a linear Gaussian process yields an explicit 'dual filter' inference algorithm whose layered structure mirrors decoder-only transformers. Numerical experiments compare the resulting weights to attention weights from a trained transformer and conclude that transformers exploit non-Markovian structure when the embedding dimension is insufficient.

Significance. If the derivations are exact and the model classes representative, the work would provide a first-principles optimal-control derivation of transformer-like inference, a notable strength given the parameter-free character of the dual-filter construction. The numerical comparison in the insufficient-embedding regime offers limited empirical grounding but does not yet test reproduction of attention patterns on data generated from the paper's own nonlinear process.

major comments (3)

[§3] §3 (nonlinear model derivation): the step from the optimal-control solution to the explicit dual-filter layer operations that mirror decoder-only transformer blocks is stated as a direct consequence, yet the manuscript does not exhibit the intermediate equations showing how the control inputs produce the precise attention and feed-forward forms without hidden approximations or additional assumptions.
[Numerical experiments] Numerical experiments section: the reported comparison is confined to the insufficient-embedding regime and does not include a test of whether the dual-filter controls reproduce attention patterns when data are drawn from the paper's nonlinear discrete-valued process, leaving the representativeness claim unverified.
[§4] §4 (linear Gaussian baseline): while the linear case is presented as tractable, the manuscript does not quantify how closely the resulting dual filter approximates the nonlinear case or under what conditions the mirroring to transformer layers remains structurally identical.

minor comments (2)

A diagram explicitly mapping dual-filter steps to transformer blocks (query/key/value, residual connections, etc.) would improve readability of the central claim.
[Abstract] The abstract's phrasing that the dual filter 'mirrors' transformer layers should be qualified with the precise sense of mirroring (structural vs. functional) once the derivation is clarified.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [§3] §3 (nonlinear model derivation): the step from the optimal-control solution to the explicit dual-filter layer operations that mirror decoder-only transformer blocks is stated as a direct consequence, yet the manuscript does not exhibit the intermediate equations showing how the control inputs produce the precise attention and feed-forward forms without hidden approximations or additional assumptions.

Authors: We agree that the transition from the optimal-control solution to the explicit dual-filter operations would be clearer with additional intermediate steps. In the revised manuscript we will insert the missing equations that derive the precise attention and feed-forward forms directly from the control inputs, confirming that no hidden approximations are required. revision: yes
Referee: [Numerical experiments] Numerical experiments section: the reported comparison is confined to the insufficient-embedding regime and does not include a test of whether the dual-filter controls reproduce attention patterns when data are drawn from the paper's nonlinear discrete-valued process, leaving the representativeness claim unverified.

Authors: The experiments deliberately target the insufficient-embedding regime because that is where the non-Markovian behavior of trained transformers becomes visible. To strengthen the representativeness claim we will add a new set of experiments that generate sequences from the nonlinear discrete-valued process itself and directly compare the dual-filter controls against attention weights obtained from a transformer trained on those sequences. revision: yes
Referee: [§4] §4 (linear Gaussian baseline): while the linear case is presented as tractable, the manuscript does not quantify how closely the resulting dual filter approximates the nonlinear case or under what conditions the mirroring to transformer layers remains structurally identical.

Authors: Section 4 presents the linear Gaussian model as an exactly solvable baseline. In the revision we will add a quantitative comparison (e.g., via linearization error bounds) between the linear dual filter and its nonlinear counterpart, together with an explicit statement of the conditions (small nonlinearity, sufficient embedding dimension) under which the layer structure remains identical to the transformer blocks. revision: yes

Circularity Check

0 steps flagged

Derivation proceeds from assumed generative models via optimal control without reducing to fitted transformer outputs or self-citation chains

full rationale

The paper begins with two explicitly stated generative model classes (nonlinear discrete-valued process motivated by transformers, and linear Gaussian baseline), reformulates the next-token prediction objective as an optimal control problem, and derives the dual filter solution whose layer structure is shown to mirror decoder-only transformers. This chain is internal to the chosen models and optimal control theory; no step fits parameters to real transformer weights or invokes prior self-citations as load-bearing uniqueness results. The numerical experiments are presented as a separate comparison rather than part of the derivation itself. The representativeness of the model classes for actual data is an external validity question, not a circularity issue within the claimed first-principles derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of modeling the sequence process by either the nonlinear discrete-valued class or the linear Gaussian class and on the equivalence between the prediction objective and the optimal-control cost.

axioms (2)

domain assumption Sequence data can be generated by a nonlinear discrete-valued process or a linear Gaussian process.
The framework is developed separately for these two model classes.
domain assumption The conditional-probability prediction objective can be exactly recast as a finite-horizon optimal-control problem.
This recasting is the step that allows the dual filter to be derived.

pith-pipeline@v0.9.0 · 5676 in / 1281 out tokens · 33317 ms · 2026-05-20T21:20:41.718146+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the prediction objective is reformulated as an optimal control problem whose solution yields an explicit inference algorithm, the dual filter, with a layer structure that mirrors the layer structure of a decoder-only transformer
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

JT(u;f) := ½ |y0|²_Σ0 + ½ Σ (|yt+1|²_Q + |ut|²_R)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 2 internal anchors

[1]

Formal algorithms for transformers.arXiv preprint arXiv:2207.09238, 2022

Mary Phuong and Marcus Hutter. Formal algorithms for transformers.arXiv preprint arXiv:2207.09238, 2022

work page arXiv 2022
[2]

Letrouit, Y

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on transformers.arXiv preprint arXiv:2312.10794, 2023

work page arXiv 2023
[3]

Geshkovski, P

Borjan Geshkovski, Philippe Rigollet, and Domènec Ruiz-Balet. Measure-to-measure interpola- tion using transformers.arXiv preprint arXiv:2411.04551, 2024

work page arXiv 2024
[4]

The asymptotic behavior of attention in transformers.arXiv preprint arXiv:2412.02682,

Álvaro Rodríguez Abella, João Pedro Silvestre, and Paulo Tabuada. The asymptotic behavior of attention in transformers.arXiv preprint arXiv:2412.02682, 2024

work page arXiv 2024
[5]

Approximate controllability of continuity equation of transformers.IEEE Control Systems Letters, 2024

Daniel Owusu Adu and Bahman Gharesifard. Approximate controllability of continuity equation of transformers.IEEE Control Systems Letters, 2024

work page 2024
[6]

A unified perspective on the dynamics of deep transformers.arXiv preprint arXiv:2501.18322,

Valérie Castin, Pierre Ablin, José Antonio Carrillo, and Gabriel Peyré. A unified perspective on the dynamics of deep transformers.arXiv preprint arXiv:2501.18322, 2025

work page arXiv 2025
[7]

An optimal control approach to transformer training.arXiv preprint arXiv:2603.09571, 2026

Ka˘gan Akman, Naci Saldı, and Serdar Yüksel. An optimal control approach to transformer training.arXiv preprint arXiv:2603.09571, 2026

work page arXiv 2026
[8]

Optimal control for transformer architectures: Enhancing generalization, robustness and efficiency

Kelvin Kan, Xingjian Li, Benjamin Zhang, Tuhin Sahai, Stanley Osher, and Markos Kat- soulakis. Optimal control for transformer architectures: Enhancing generalization, robustness and efficiency. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page
[9]

Localmax dynamics for attention in transformers and its asymptotic behavior.arXiv preprint arXiv:2509.15958, 2025

Henri Cimetière, Maria Teresa Chiri, and Bahman Gharesifard. Localmax dynamics for attention in transformers and its asymptotic behavior.arXiv preprint arXiv:2509.15958, 2025

work page arXiv 2025
[10]

Llms as high-dimensional nonlinear autoregressive models with attention: Training, alignment and inference.arXiv preprint arXiv:2602.00426, 2026

Vikram Krishnamurthy. Llms as high-dimensional nonlinear autoregressive models with attention: Training, alignment and inference.arXiv preprint arXiv:2602.00426, 2026

work page arXiv 2026
[11]

An Explanation of In-context Learning as Implicit Bayesian Inference

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference.arXiv preprint arXiv:2111.02080, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Transformers as statisticians: Provable in-context learning with in-context algorithm selection.Advances in neural information processing systems, 36:57125–57211, 2023

Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as statisticians: Provable in-context learning with in-context algorithm selection.Advances in neural information processing systems, 36:57125–57211, 2023

work page 2023
[13]

Causal interpretation of self-attention in pre-trained transformers.Advances in Neural Information Processing Systems, 36:31450– 31465, 2023

Raanan Y Rohekar, Yaniv Gurwicz, and Shami Nisimov. Causal interpretation of self-attention in pre-trained transformers.Advances in Neural Information Processing Systems, 36:31450– 31465, 2023

work page 2023
[14]

Attentive state-space modeling of disease progres- sion.Advances in neural information processing systems, 32, 2019

Ahmed M Alaa and Mihaela van der Schaar. Attentive state-space modeling of disease progres- sion.Advances in neural information processing systems, 32, 2019

work page 2019
[15]

Probabilistic transformer for time series analysis.Advances in neural information processing systems, 34:23592–23608, 2021

Binh Tang and David S Matteson. Probabilistic transformer for time series analysis.Advances in neural information processing systems, 34:23592–23608, 2021

work page 2021
[16]

Can a transformer represent a kalman filter? In6th Annual Learning for Dynamics & Control Conference, pages 1502–1512

Gautam Goel and Peter Bartlett. Can a transformer represent a kalman filter? In6th Annual Learning for Dynamics & Control Conference, pages 1502–1512. PMLR, 2024

work page 2024
[17]

Can transformers learn optimal filtering for unknown systems?IEEE Control Systems Letters, 7:3525–3530, 2023

Zhe Du, Haldun Balim, Samet Oymak, and Necmiye Ozay. Can transformers learn optimal filtering for unknown systems?IEEE Control Systems Letters, 7:3525–3530, 2023

work page 2023
[18]

Dual filter: A mathematical framework for inference using transformer-like architectures.arXiv preprint arXiv:2505.00818, 2025

Heng-Sheng Chang and Prashant G Mehta. Dual filter: A mathematical framework for inference using transformer-like architectures.arXiv preprint arXiv:2505.00818, 2025

work page arXiv 2025
[19]

John Wiley & Sons, 2001

Yaakov Bar-Shalom, X Rong Li, and Thiagalingam Kirubarajan.Estimation with applications to tracking and navigation: theory algorithms and software. John Wiley & Sons, 2001

work page 2001
[20]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Mehta, and Sean Meyn

Jin Won Kim, Prashant G. Mehta, and Sean Meyn. What is the Lagrangian for nonlinear filtering? In2019 IEEE 58th Conference on Decision and Control (CDC), pages 1607–1614, Nice, France, 12 2019. IEEE

work page 2019
[22]

Duality for nonlinear filtering ii: Optimal control.IEEE Transactions on Automatic Control, 69(2):712–725, 2023

Jin Won Kim and Prashant G Mehta. Duality for nonlinear filtering ii: Optimal control.IEEE Transactions on Automatic Control, 69(2):712–725, 2023

work page 2023
[23]

Springer, 2018

Alain Bensoussan.Estimation and control of dynamical systems, volume 48. Springer, 2018

work page 2018
[24]

Prentice Hall, 2000

Thomas Kailath, Ali H Sayed, and Babak Hassibi.Linear estimation. Prentice Hall, 2000

work page 2000
[25]

General duality between optimal control and estimation

Emanuel Todorov. General duality between optimal control and estimation. In2008 47th IEEE conference on decision and control, pages 4286–4292. IEEE, 2008

work page 2008
[26]

The arrow of time in estimation and control: Duality theory beyond the linear gaussian model.IEEE Control Systems, 45(2):70–90, 2025

Jin Won Kim and Prashant G Mehta. The arrow of time in estimation and control: Duality theory beyond the linear gaussian model.IEEE Control Systems, 45(2):70–90, 2025

work page 2025
[27]

nanogpt: The simplest, fastest repository for training/finetuning medium- sized gpts.https://github.com/karpathy/nanoGPT, 2024

Andrej Karpathy. nanogpt: The simplest, fastest repository for training/finetuning medium- sized gpts.https://github.com/karpathy/nanoGPT, 2024

work page 2024
[28]

Differentiable filtering for learning hidden markov models

Reginald Zhiyan Chen, Heng-Sheng Chang, and Prashant G Mehta. Differentiable filtering for learning hidden markov models. In8th Annual Learning for Dynamics and Control Conference, 2026. A Explicit formulae for the dual optimal control problem in Sec. 3.1 This section provides explicit expressions for the BS∆E dual control system, the optimal control obje...

work page 2026
[29]

Assume P([Z=z])≥c T >0 for all z∈O T , and the existence of a unique U follows directly from the earlier result

work page
[30]

Then again, a particular selection of U follows from the above result

Adopt the convention 0 0 = 0 to define (or extend) the conditional expectation for sample paths Z=z with P([Z=z]) = 0 . Then again, a particular selection of U follows from the above result. In the second case, however, there may be other choices of U such that the representation (4) holds: Any two choices will yield a representation that coincides on the...

work page

[1] [1]

Formal algorithms for transformers.arXiv preprint arXiv:2207.09238, 2022

Mary Phuong and Marcus Hutter. Formal algorithms for transformers.arXiv preprint arXiv:2207.09238, 2022

work page arXiv 2022

[2] [2]

Letrouit, Y

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on transformers.arXiv preprint arXiv:2312.10794, 2023

work page arXiv 2023

[3] [3]

Geshkovski, P

Borjan Geshkovski, Philippe Rigollet, and Domènec Ruiz-Balet. Measure-to-measure interpola- tion using transformers.arXiv preprint arXiv:2411.04551, 2024

work page arXiv 2024

[4] [4]

The asymptotic behavior of attention in transformers.arXiv preprint arXiv:2412.02682,

Álvaro Rodríguez Abella, João Pedro Silvestre, and Paulo Tabuada. The asymptotic behavior of attention in transformers.arXiv preprint arXiv:2412.02682, 2024

work page arXiv 2024

[5] [5]

Approximate controllability of continuity equation of transformers.IEEE Control Systems Letters, 2024

Daniel Owusu Adu and Bahman Gharesifard. Approximate controllability of continuity equation of transformers.IEEE Control Systems Letters, 2024

work page 2024

[6] [6]

A unified perspective on the dynamics of deep transformers.arXiv preprint arXiv:2501.18322,

Valérie Castin, Pierre Ablin, José Antonio Carrillo, and Gabriel Peyré. A unified perspective on the dynamics of deep transformers.arXiv preprint arXiv:2501.18322, 2025

work page arXiv 2025

[7] [7]

An optimal control approach to transformer training.arXiv preprint arXiv:2603.09571, 2026

Ka˘gan Akman, Naci Saldı, and Serdar Yüksel. An optimal control approach to transformer training.arXiv preprint arXiv:2603.09571, 2026

work page arXiv 2026

[8] [8]

Optimal control for transformer architectures: Enhancing generalization, robustness and efficiency

Kelvin Kan, Xingjian Li, Benjamin Zhang, Tuhin Sahai, Stanley Osher, and Markos Kat- soulakis. Optimal control for transformer architectures: Enhancing generalization, robustness and efficiency. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page

[9] [9]

Localmax dynamics for attention in transformers and its asymptotic behavior.arXiv preprint arXiv:2509.15958, 2025

Henri Cimetière, Maria Teresa Chiri, and Bahman Gharesifard. Localmax dynamics for attention in transformers and its asymptotic behavior.arXiv preprint arXiv:2509.15958, 2025

work page arXiv 2025

[10] [10]

Llms as high-dimensional nonlinear autoregressive models with attention: Training, alignment and inference.arXiv preprint arXiv:2602.00426, 2026

Vikram Krishnamurthy. Llms as high-dimensional nonlinear autoregressive models with attention: Training, alignment and inference.arXiv preprint arXiv:2602.00426, 2026

work page arXiv 2026

[11] [11]

An Explanation of In-context Learning as Implicit Bayesian Inference

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference.arXiv preprint arXiv:2111.02080, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[12] [12]

Transformers as statisticians: Provable in-context learning with in-context algorithm selection.Advances in neural information processing systems, 36:57125–57211, 2023

Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as statisticians: Provable in-context learning with in-context algorithm selection.Advances in neural information processing systems, 36:57125–57211, 2023

work page 2023

[13] [13]

Causal interpretation of self-attention in pre-trained transformers.Advances in Neural Information Processing Systems, 36:31450– 31465, 2023

Raanan Y Rohekar, Yaniv Gurwicz, and Shami Nisimov. Causal interpretation of self-attention in pre-trained transformers.Advances in Neural Information Processing Systems, 36:31450– 31465, 2023

work page 2023

[14] [14]

Attentive state-space modeling of disease progres- sion.Advances in neural information processing systems, 32, 2019

Ahmed M Alaa and Mihaela van der Schaar. Attentive state-space modeling of disease progres- sion.Advances in neural information processing systems, 32, 2019

work page 2019

[15] [15]

Probabilistic transformer for time series analysis.Advances in neural information processing systems, 34:23592–23608, 2021

Binh Tang and David S Matteson. Probabilistic transformer for time series analysis.Advances in neural information processing systems, 34:23592–23608, 2021

work page 2021

[16] [16]

Can a transformer represent a kalman filter? In6th Annual Learning for Dynamics & Control Conference, pages 1502–1512

Gautam Goel and Peter Bartlett. Can a transformer represent a kalman filter? In6th Annual Learning for Dynamics & Control Conference, pages 1502–1512. PMLR, 2024

work page 2024

[17] [17]

Can transformers learn optimal filtering for unknown systems?IEEE Control Systems Letters, 7:3525–3530, 2023

Zhe Du, Haldun Balim, Samet Oymak, and Necmiye Ozay. Can transformers learn optimal filtering for unknown systems?IEEE Control Systems Letters, 7:3525–3530, 2023

work page 2023

[18] [18]

Dual filter: A mathematical framework for inference using transformer-like architectures.arXiv preprint arXiv:2505.00818, 2025

Heng-Sheng Chang and Prashant G Mehta. Dual filter: A mathematical framework for inference using transformer-like architectures.arXiv preprint arXiv:2505.00818, 2025

work page arXiv 2025

[19] [19]

John Wiley & Sons, 2001

Yaakov Bar-Shalom, X Rong Li, and Thiagalingam Kirubarajan.Estimation with applications to tracking and navigation: theory algorithms and software. John Wiley & Sons, 2001

work page 2001

[20] [20]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Mehta, and Sean Meyn

Jin Won Kim, Prashant G. Mehta, and Sean Meyn. What is the Lagrangian for nonlinear filtering? In2019 IEEE 58th Conference on Decision and Control (CDC), pages 1607–1614, Nice, France, 12 2019. IEEE

work page 2019

[22] [22]

Duality for nonlinear filtering ii: Optimal control.IEEE Transactions on Automatic Control, 69(2):712–725, 2023

Jin Won Kim and Prashant G Mehta. Duality for nonlinear filtering ii: Optimal control.IEEE Transactions on Automatic Control, 69(2):712–725, 2023

work page 2023

[23] [23]

Springer, 2018

Alain Bensoussan.Estimation and control of dynamical systems, volume 48. Springer, 2018

work page 2018

[24] [24]

Prentice Hall, 2000

Thomas Kailath, Ali H Sayed, and Babak Hassibi.Linear estimation. Prentice Hall, 2000

work page 2000

[25] [25]

General duality between optimal control and estimation

Emanuel Todorov. General duality between optimal control and estimation. In2008 47th IEEE conference on decision and control, pages 4286–4292. IEEE, 2008

work page 2008

[26] [26]

The arrow of time in estimation and control: Duality theory beyond the linear gaussian model.IEEE Control Systems, 45(2):70–90, 2025

Jin Won Kim and Prashant G Mehta. The arrow of time in estimation and control: Duality theory beyond the linear gaussian model.IEEE Control Systems, 45(2):70–90, 2025

work page 2025

[27] [27]

nanogpt: The simplest, fastest repository for training/finetuning medium- sized gpts.https://github.com/karpathy/nanoGPT, 2024

Andrej Karpathy. nanogpt: The simplest, fastest repository for training/finetuning medium- sized gpts.https://github.com/karpathy/nanoGPT, 2024

work page 2024

[28] [28]

Differentiable filtering for learning hidden markov models

Reginald Zhiyan Chen, Heng-Sheng Chang, and Prashant G Mehta. Differentiable filtering for learning hidden markov models. In8th Annual Learning for Dynamics and Control Conference, 2026. A Explicit formulae for the dual optimal control problem in Sec. 3.1 This section provides explicit expressions for the BS∆E dual control system, the optimal control obje...

work page 2026

[29] [29]

Assume P([Z=z])≥c T >0 for all z∈O T , and the existence of a unique U follows directly from the earlier result

work page

[30] [30]

Then again, a particular selection of U follows from the above result

Adopt the convention 0 0 = 0 to define (or extend) the conditional expectation for sample paths Z=z with P([Z=z]) = 0 . Then again, a particular selection of U follows from the above result. In the second case, however, there may be other choices of U such that the representation (4) holds: Any two choices will yield a representation that coincides on the...

work page