arxiv: 2604.25904 · v1 · submitted 2026-04-28 · 💻 cs.LG · math.DS· stat.ML

Recognition: unknown

Teacher Forcing as Generalized Bayes: Optimization Geometry Mismatch in Switching Surrogates for Chaotic Dynamics

Andre Herz , Daniel Durstewitz , Georgia Koppe

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:30 UTC · model grok-4.3

classification 💻 cs.LG math.DSstat.ML

keywords teacher forcingrecurrent neural networkschaotic dynamicsswitching modelsoptimization geometrymarginal likelihoodalmost-linear RNNsdynamical systems reconstruction

0 comments

The pith

In switching recurrent models for chaotic dynamics, teacher forcing inflates the curvature of the training objective compared to the marginal likelihood.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why identity teacher forcing, which stabilizes training of recurrent networks for chaotic systems, can produce different results from optimizing the model's actual predictive likelihood. It shows that by forcing a single regime sequence, teacher forcing increases the apparent sharpness of the loss landscape, whereas the true marginal likelihood smooths this out by averaging over possible regime switches. This geometry mismatch matters because it affects how well the trained model captures the underlying dynamics when run freely. In experiments with almost-linear RNNs on the Lorenz system, adjusting towards the evidence improves some likelihood measures but can worsen key dynamical properties.

Core claim

Identity teacher forcing corresponds to a generalized Bayes update conditioned on a single forced regime path, which inflates the observed information (curvature) relative to the marginal likelihood. The marginal likelihood curvature is reduced by a missing-information correction that accounts for regime ambiguity, estimated via Louis' identity in a probabilistic switching extension of AL-RNNs. This leads to the observation that windowed evidence fine-tuning can enhance held-out evidence at the potential cost of dynamical quality of interest metrics.

What carries the argument

The comparison of optimization curvatures between identity teacher forcing and marginal likelihood, computed using Louis' identity on a probabilistic switching augmentation of almost-linear RNNs.

Load-bearing premise

The probabilistic switching augmentation accurately captures the ambiguity in regimes and that Louis' identity provides an unbiased estimate of the observed information matrix without additional assumptions on the switching process.

What would settle it

If the estimated curvature difference between ITF and marginal likelihood disappears in experiments where regime paths are known to be unique and unambiguous, or if Louis' identity fails to match direct Hessian computations in small models.

Figures

Figures reproduced from arXiv: 2604.25904 by Andre Herz, Daniel Durstewitz, Georgia Koppe.

**Figure 1.** Figure 1: Summary of results. (a) In a probit-gated switching AR(1) model, increasing gate noise σg leads to higher posterior switching ambiguity. This increases both mean posterior gate entropy and missing-information ratio (MIR), while decreasing observed-curvature proxy log10 tr(Iobs). Faint points = individual runs; markers = mean ± SEM across 20 seeds. (b) Curvature gap gQ (measuring how much ITF curvature exce… view at source ↗

**Figure 2.** Figure 2: Directed graphical model of the probabilistic AL-RNN (PAL-RNN), shown as a conditional Bayesian view at source ↗

**Figure 3.** Figure 3: RBPF diagnostics on a representative SAEM evaluation window. Panels: view at source ↗

**Figure 4.** Figure 4: Matrix-aware mismatch diagnostics comparing ITF curvature b view at source ↗

**Figure 5.** Figure 5: Qualitative attractor reconstruction visualization for a representative AL-RNN initialization checkpoint view at source ↗

read the original abstract

Identity teacher forcing (ITF) enables stable training of deterministic recurrent surrogates for chaotic dynamical systems and has been highly effective for dynamical systems reconstruction (DSR) with recurrent neural networks (RNNs), including interpretable almost-linear RNNs (AL-RNNs). However, as an intervention-based prediction loss (and thus a generalized Bayes update), teacher forcing need not match the free-running model's marginal likelihood geometry. We compare the objective-induced curvatures of ITF and marginal likelihood in a probabilistic switching augmentation of AL-RNNs, estimating ambiguity-aware observed information via Louis' identity. In the switching setting studied here, conditioning on a single forced regime path (as ITF does) inflates curvature, while marginal likelihood curvature is reduced by a missing-information correction when multiple switching explanations remain plausible. In Lorenz-63 experiments, windowed evidence fine-tuning improves held-out evidence but can degrade dynamical quantities of interest (QoIs) relative to ITF-pretrained models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ITF inflates curvature relative to marginal likelihood in switching AL-RNNs because it conditions on one regime path while the marginal accounts for ambiguity via Louis' identity.

read the letter

The punchline is that identity teacher forcing does not match the marginal likelihood geometry in probabilistic switching AL-RNNs for chaotic dynamics. Conditioning on a single regime path inflates curvature, while the marginal gets a reduction from the missing-information term when switches are ambiguous. The paper does a good job applying Louis' identity here to make the comparison quantitative. The Lorenz-63 results illustrate the point nicely by showing the trade-off: windowed evidence fine-tuning boosts held-out evidence but can worsen the dynamical quantities of interest relative to the ITF models. One soft spot is that everything is demonstrated in the switching augmentation, which may amplify the effect. It would be useful to see how this plays out in non-switching cases or with different models. The experiments could also include more statistical rigor around the QoI differences. Overall, this is targeted at people building and training surrogates for chaotic systems in science. The argument is clear and the evidence is relevant, so it should go to peer review for a closer look.

Referee Report

2 major / 2 minor

Summary. The paper claims that identity teacher forcing (ITF) acts as a generalized Bayes update whose induced curvature on probabilistic switching augmentations of AL-RNNs differs from that of the marginal likelihood; specifically, conditioning on a single forced regime path inflates curvature while the marginal likelihood receives a missing-information correction (via Louis' identity) when multiple regime explanations remain plausible. This geometry mismatch is analyzed theoretically and demonstrated empirically on Lorenz-63, where windowed evidence fine-tuning after ITF pretraining improves held-out evidence yet can degrade dynamical quantities of interest relative to the ITF-pretrained baseline.

Significance. If the central comparison holds, the work supplies a concrete, quantifiable account of why intervention-based losses like teacher forcing succeed for chaotic DSR even though they are not marginal-likelihood estimators. The application of Louis' identity to obtain ambiguity-aware observed information in a switching RNN setting is a reusable technical contribution, and the reported trade-off between evidence and dynamical QoIs on Lorenz-63 supplies falsifiable guidance for hybrid training pipelines.

major comments (2)

[§3.2] §3.2, application of Louis' identity: the derivation treats the switching indicators as missing data whose conditional expectations are computed under the ITF-forced path; it is not shown whether this expectation is taken with respect to the same posterior that defines the marginal likelihood, which is required for the curvature comparison to be internally consistent.
[§4.3] §4.3, Lorenz-63 results: the claim that fine-tuning 'can degrade' dynamical QoIs is supported only by point estimates on a single trajectory; without reported variance across random seeds or multiple initial conditions, it is impossible to judge whether the observed degradation is systematic or an artifact of the particular windowing schedule.

minor comments (2)

Notation: the symbol for the observed information matrix is introduced without an explicit definition linking it to the Hessian of the negative log-marginal likelihood; a one-line reminder of the relation I_obs = -H would improve readability.
Figure 2 caption: the phrase 'curvature inflation' is used without stating the numerical factor by which the ITF curvature exceeds the marginal curvature on the plotted example; adding the ratio would make the visual comparison quantitative.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We respond to each major comment below.

read point-by-point responses

Referee: [§3.2] §3.2, application of Louis' identity: the derivation treats the switching indicators as missing data whose conditional expectations are computed under the ITF-forced path; it is not shown whether this expectation is taken with respect to the same posterior that defines the marginal likelihood, which is required for the curvature comparison to be internally consistent.

Authors: In §3.2 the Louis identity is applied to the marginal likelihood of the switching model, so the conditional expectations over regime indicators are taken with respect to the posterior p(regimes | data, θ) that defines that marginal likelihood. The ITF objective is treated separately as a conditioning on a single regime path; its curvature is compared to the marginal curvature obtained via Louis. We will add an explicit sentence in the revised manuscript stating that the expectations for the marginal observed information are under the marginal posterior, making the internal consistency of the comparison clear. revision: partial
Referee: [§4.3] §4.3, Lorenz-63 results: the claim that fine-tuning 'can degrade' dynamical QoIs is supported only by point estimates on a single trajectory; without reported variance across random seeds or multiple initial conditions, it is impossible to judge whether the observed degradation is systematic or an artifact of the particular windowing schedule.

Authors: The manuscript states that fine-tuning 'can degrade' dynamical QoIs, which is supported by the concrete example shown. We agree that variance across seeds and initial conditions would allow readers to assess how frequently the trade-off appears. In the revision we will add results from multiple random seeds and initial conditions, reporting means and standard deviations of the QoI changes under both the ITF baseline and the evidence-fine-tuned models. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper contrasts ITF-induced curvature inflation against marginal-likelihood curvature reduction via Louis' identity in a probabilistic switching AL-RNN model. Louis' identity is an external, standard missing-data result (not derived or fitted inside the paper). The geometry-mismatch claim rests on explicit modeling choices and Lorenz-63 experiments rather than any self-definitional loop, fitted-input-renamed-as-prediction, or load-bearing self-citation chain. No equation reduces to its own inputs by construction, and the derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on the abstract alone, the central claim rests on the probabilistic formulation of switching AL-RNNs and the application of Louis' identity; no explicit free parameters, new axioms, or invented entities are identifiable from the given text.

axioms (1)

standard math Louis' identity for computing observed information in the presence of missing regime-path data
Invoked to obtain ambiguity-aware observed information for the marginal likelihood.

pith-pipeline@v0.9.0 · 5478 in / 1390 out tokens · 67331 ms · 2026-05-07T16:30:05.166203+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Benettin, G., Galgani, L., Giorgilli, A., and Strelcyn, J.-M. (1980). Lyapunov characteristic exponents for smooth dynamical systems and for hamiltonian systems; a method for computing all of them. part 1: Theory.Meccanica, 15(1):9–20. Bissiri, P. G., Holmes, C., and Walker, S. (2016). A general framework for updating belief distributions. Bayesian Analys...

work page arXiv 1980
[2]

Detecting Invariant Manifolds in ReLU-Based RNNs

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Soci- ety: Series B, 39(1):1–38. Doucet, A., de Freitas, N., Murphy, K., and Russell, S. (2000a). Rao–blackwellised particle filtering for dy- namic bayesian networks. InProceedings of the 16th Conference on...

work page internal anchor Pith review Pith/arXiv arXiv 1977
[3]

maximal LE

Curran Associates, Inc. Emonds, N., Herberg, E., Gerchen, M. F., Pritsch, M., Rocha, J., Zamoscik, V., Kirsch, P., Herzog, R., and Koppe, G. (2025). A data-driven closed-loop control approach to drive neural state transitions for mechanistic insight.bioRxiv. Ghahramani, Z. and Hinton, G. E. (2000). Variational learning for switching state-space models.Neu...

work page arXiv 2025
[4]

Table 1: Settings used for the curvature-gap analysis in Fig. 1b. Setting Value AL-RNN training settings Dataset Lorenz-63 Training trajectory lengthTtrain 8·104 Noise regimesσ proc∈{0.1,0.3,0.5},σobs∈{0.1,0.3,0.5} Forcing intervalτ{4,8,16,32,64} Latent dimensionM30 Number of gated unitsP10 Batch sizeB16 BPTT sequence lengthLBPTT 200 Epochs 2000 Batches p...

2000
[5]

The continuous-time Lorenz-63 dynamics are dz1 dt =σ(z2−z1), dz2 dt =z 1(ρ−z3)−z2, dz3 dt =z 1z2−βz3,(A45) with the standard chaotic parameter settingσ= 10, ρ= 28, and β= 8/3

across a sweep of observation- and process-noise regimes. The continuous-time Lorenz-63 dynamics are dz1 dt =σ(z2−z1), dz2 dt =z 1(ρ−z3)−z2, dz3 dt =z 1z2−βz3,(A45) with the standard chaotic parameter settingσ= 10, ρ= 28, and β= 8/3. We denote the corresponding discrete-time latent state byzt∈R3 (with components corresponding to(z1,z 2,z 3)). We discretiz...

1901
[6]

blend step

Algorithm A2Particle-SAEM update on windowed evidence 1: Input:sequences {x(j) 1:T}nseq j=1; window lengthL, initial parametersθ(1), configuration (baseline / calib / full SAEM), fixed gate noiseσg. 2: Hyperparameters:SAEM iterations R, windows per iterationB, RBPF particlesNp, smoothing samples S, resampling thresholdτESS, ridge regularization weightλR, ...

1999
[7]

Missing-information ratio (MIR)

and update rule θ(r+1) := (1−αM)θ(r) +αM ˜θ(r),(A55) where ˜θ(r) denotes the ridge-regularized M-step solution for the chosen parameter blocks based on the Monte Carlo E-step at iterationr. Pseudo-code is given in Algorithm A2. A3 Metric definitions and computation details This section defines the scalar metrics reported in the main-text figures and fixes...

2000