arxiv: 2605.05029 · v1 · submitted 2026-05-06 · 💻 cs.LG

Recognition: unknown

The Predictive-Causal Gap: An Impossibility Theorem and Large-Scale Neural Evidence

Kejun Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:23 UTC · model grok-4.3

classification 💻 cs.LG

keywords predictive representation learningcausal fidelityimpossibility theoremlinear-Gaussian dynamicsenvironment modessystem modesneural networksworld models

0 comments

The pith

When environment modes are slower or less noisy than system modes, every minimizer of the population prediction risk encodes the environment rather than the system.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that predictive representation learning has a structural bias toward encoding environment dynamics over the intended system dynamics. In large-scale experiments with thousands of neural networks on linear-Gaussian systems, the average causal fidelity of optimal encoders is only 0.49 and drops near zero at high dimensions, even as prediction error improves. The authors prove this bias holds for every risk minimizer under the condition that environment modes are slower or less noisy, and that the set of dynamics producing the gap is open and has positive measure in parameter space. The same pattern appears in nonlinear systems, where unconstrained predictors favor environment-dominant representations. This matters because it shows that pure predictive objectives cannot recover causal system representations without an explicit boundary between system and environment.

Core claim

The central claim is that a predictive-causal gap exists as a structural property of the predictive objective: when environment modes are slower or less noisy than system modes, every minimizer of the population risk encodes the former. This is shown by decomposing linear-Gaussian dynamics into separable modes, proving that the optimal encoder allocates sensitivity away from system degrees of freedom, and confirming the result holds across an open positive-measure set of parameters. Empirical sweeps of 2695 configurations and nonlinear Duffing-GRU tasks demonstrate low causal fidelity, with operational grounding that restricts the loss to system observables reducing but not eliminating the 1

What carries the argument

population risk minimization over linear-Gaussian dynamics with separable system and environment modes of differing speeds and noise levels

If this is right

At dimension 100 the optimal encoder becomes causally blind while still achieving 92 percent lower prediction error than a causal representation.
The set of dynamics that produce the predictive-causal gap forms an open set of positive measure in parameter space.
Operational grounding that restricts the loss to system observables lowers environment dominance but never restores full causal fidelity without an explicit boundary.
In nonlinear Duffing-GRU sweeps, unconstrained predictors learn environment-dominant representations in 55 percent of tasks and suffer 1.82 times higher out-of-distribution MSE under environment shifts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Self-supervised world models may systematically fail to capture the causal structure of the intended system when trained on raw predictive objectives.
Scaling predictive models without enforcing mode separation could increase out-of-distribution fragility in environments with mixed timescales.
Hybrid objectives that combine prediction with explicit system-environment constraints may be required to close the gap.

Load-bearing premise

The dynamics can be cleanly decomposed into separable system and environment modes with distinct temporal and noise characteristics.

What would settle it

A concrete counterexample in the linear-Gaussian case where a minimizer of the population risk achieves high causal fidelity while environment modes remain slower or less noisy would disprove the theorem.

Figures

Figures reproduced from arXiv: 2605.05029 by Kejun Liu.

**Figure 1.** Figure 1: FIG. 1. Linear encoder fidelity across 160 deterministic con view at source ↗

**Figure 3.** Figure 3: FIG. 3. High-dimensional scaling. Left: predictive-causal view at source ↗

**Figure 4.** Figure 4: FIG. 4. Environment-dominance fraction across ( view at source ↗

**Figure 5.** Figure 5: FIG. 5. Left: aggregate environment-dominance fraction with view at source ↗

read the original abstract

We report a systematic failure mode in predictive representation learning. Across 2695 neural network configurations trained to predict linear-Gaussian dynamics, the optimal encoder tracks the environment rather than the system it is meant to model. The mean causal fidelity -- the fraction of encoder sensitivity allocated to system degrees of freedom -- is 0.49, and only 2.5% of configurations exceed 0.70. The failure intensifies with dimension: at N=100, the optimal encoder becomes causally blind (fidelity ~10^{-8}) while achieving 92% lower prediction error than the causal representation. We prove this is not an optimization artifact but a structural property of the predictive objective: when environment modes are slower or less noisy than system modes, every minimizer of the population risk encodes the former. The set of dynamics exhibiting this predictive-causal gap is open and of positive measure in parameter space. In a nonlinear Duffing-GRU sweep, unconstrained predictors learn environment-dominant representations in 55% of tasks (95% CI 41--68%) versus 24% under operational grounding (p=2.3e-3); the median out-of-distribution MSE inflation under environment shift is 1.82x versus 1.00x. Operational grounding -- restricting the loss to system observables -- partially suppresses the gap, but causal fidelity is never recovered without an explicit system-environment boundary. The results identify the predictive-causal gap as a structural limit of learning, with implications for self-supervised representation learning, world models, and the scaling paradigm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that predictive objectives structurally favor slower/less-noisy environment modes over causal system ones in linear-Gaussian dynamics, with large-scale evidence, but the same structural argument does not clearly carry to the nonlinear experiments.

read the letter

The main thing to know is that this work identifies a bias in pure predictive representation learning: when environment modes have slower timescales or lower noise than the system modes, the population risk minimizer encodes the environment instead. They prove this for linear-Gaussian cases and back it with 2695 trained networks showing mean causal fidelity of 0.49 and near-zero fidelity at high dimension despite good prediction error. The nonlinear Duffing-GRU sweep finds environment-dominant representations in 55% of tasks and worse OOD error under environment shifts, with operational grounding helping only partially. That scale of experiment and the attempt to separate structural from optimization issues are the concrete contributions here. The framing of an open positive-measure set of dynamics exhibiting the gap is also new in this context. The paper does a reasonable job quantifying the practical cost in terms of out-of-distribution inflation and showing that explicit system-environment boundaries matter. The soft spots are real but contained. The impossibility result and the open-set claim rest on clean separability into ordered modes by eigenvalue and noise variance, which holds by construction in the linear-Gaussian setup but is not derived for the nonlinear oscillator cases. In the GRU experiments there is no eigendecomposition, so the reported environment dominance could arise from model bias or landscape rather than the same population-risk property; the stress-test note on this point holds up. The causal fidelity metric itself presupposes a pre-labeled split between system and environment observables, which limits how far the result generalizes without additional assumptions. This is worth a serious referee for researchers working on world models, self-supervised dynamics learning, and scaling predictive representations. The empirical volume is large enough to generate useful discussion even if the theorem's scope needs tightening. I would send it to peer review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The paper claims to identify a 'predictive-causal gap' as a structural property of predictive objectives in dynamical systems: when environment modes are slower or less noisy than system modes, every population-risk minimizer encodes the former rather than the system. This is formalized via an impossibility theorem for linear-Gaussian dynamics (showing the gap set is open and positive-measure in parameter space) and supported by large-scale experiments (2695 neural configurations on linear-Gaussian systems yielding mean causal fidelity 0.49, dropping to ~10^{-8} at N=100) plus nonlinear Duffing-GRU sweeps (environment-dominant representations in 55% of tasks, mitigated but not eliminated by operational grounding).

Significance. If the central claims hold, the work identifies a fundamental limitation of pure predictive representation learning with direct implications for world models, self-supervised learning, and scaling paradigms. Strengths include the explicit theorem for the linear-Gaussian case, the scale of the empirical sweep, and the introduction of operational grounding as a partial mitigation; these elements provide both theoretical grounding and falsifiable predictions that could guide future algorithm design.

major comments (3)

[Theorem statement and proof] Theorem on linear-Gaussian case (likely §3 or §4): the proof that every population-risk minimizer encodes slower/less-noisy environment modes relies on diagonalizability and eigenvalue/noise ordering. The manuscript should explicitly derive the allocation of encoder sensitivity (e.g., via the closed-form minimizer or Lagrangian) and confirm that the open-set/positive-measure property does not collapse under small perturbations to the mode separation assumption.
[Nonlinear experiments and discussion] Nonlinear Duffing-GRU experiments: the report of environment-dominant representations in 55% of tasks (95% CI 41-68%) is presented as evidence that the gap is not limited to linear-Gaussian regimes. However, without an eigendecomposition or equivalent structural decomposition, it is unclear whether these results arise from the same population-risk argument or from GRU inductive biases/optimization landscape. This distinction is load-bearing for the claim that the gap is a general 'structural limit of learning.'
[Empirical results on linear-Gaussian systems] High-dimensional linear results (N=100 case): the claim of 92% lower prediction error for the optimal encoder versus the causal representation requires a precise definition of the causal baseline encoder and the exact error metric (in-sample vs. out-of-distribution). Without this, the comparison risks conflating predictive performance with the causal-fidelity metric.

minor comments (2)

[Abstract] Abstract and methods: the p-value (p=2.3e-3) for the grounding comparison should be accompanied by the exact statistical test used and sample size to allow independent verification.
[Definitions] Notation: 'causal fidelity' is defined as the fraction of encoder sensitivity allocated to system degrees of freedom; provide the precise formula (e.g., projection onto system eigenvectors) in the main text rather than appendix.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate revisions where we agree changes are warranted.

read point-by-point responses

Referee: Theorem on linear-Gaussian case (likely §3 or §4): the proof that every population-risk minimizer encodes slower/less-noisy environment modes relies on diagonalizability and eigenvalue/noise ordering. The manuscript should explicitly derive the allocation of encoder sensitivity (e.g., via the closed-form minimizer or Lagrangian) and confirm that the open-set/positive-measure property does not collapse under small perturbations to the mode separation assumption.

Authors: We agree that an explicit derivation will strengthen the presentation. In the revised manuscript we will include the closed-form solution for the optimal encoder obtained by minimizing the population risk under the linear-Gaussian assumption. This derivation proceeds via the Lagrangian of the constrained least-squares problem and shows that encoder sensitivity is allocated proportionally to the inverse of the mode noise variances and inversely to the eigenvalue magnitudes. The set of parameters exhibiting the gap is defined by strict inequalities on eigenvalue and noise ordering; because these inequalities define an open set in parameter space, the positive-measure property is preserved under sufficiently small perturbations that maintain the ordering. revision: yes
Referee: Nonlinear Duffing-GRU experiments: the report of environment-dominant representations in 55% of tasks (95% CI 41-68%) is presented as evidence that the gap is not limited to linear-Gaussian regimes. However, without an eigendecomposition or equivalent structural decomposition, it is unclear whether these results arise from the same population-risk argument or from GRU inductive biases/optimization landscape. This distinction is load-bearing for the claim that the gap is a general 'structural limit of learning.'

Authors: The impossibility theorem is stated only for linear-Gaussian dynamics and supplies the structural argument. The Duffing-GRU sweep is presented as empirical evidence that qualitatively similar behavior appears outside the linear setting. We acknowledge that GRU inductive biases and the optimization landscape may contribute to the observed statistics. In the revision we will add an explicit paragraph distinguishing the proven linear case from the nonlinear observations and will state that a general nonlinear theorem remains open. We will also report the fraction of tasks in which environment dominance occurs even after controlling for initialization variance. revision: partial
Referee: High-dimensional linear results (N=100 case): the claim of 92% lower prediction error for the optimal encoder versus the causal representation requires a precise definition of the causal baseline encoder and the exact error metric (in-sample vs. out-of-distribution). Without this, the comparison risks conflating predictive performance with the causal-fidelity metric.

Authors: We apologize for the imprecise wording. The causal baseline encoder is the linear map that retains only the system modes (identity on system coordinates, zero on environment coordinates). The reported prediction error is the out-of-distribution mean-squared error evaluated on trajectories generated after an environment-parameter shift; it is not an in-sample quantity. In the revised manuscript we will state these definitions explicitly, provide the exact formula for the OOD MSE, and separate the causal-fidelity table from the prediction-error table to avoid conflation. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper states an explicit assumption of clean decomposition into system and environment modes with distinct timescales and noise levels in the linear-Gaussian case, then derives that population-risk minimizers allocate sensitivity to the slower/less-noisy modes (labeled environment) under that condition. This follows from analyzing the prediction objective rather than redefining the objective or the labels to force the outcome. Causal fidelity is defined after the decomposition but the allocation result is obtained from the risk functional, not by construction. The open-and-positive-measure claim applies to the parameter set satisfying the slower-environment condition, which is independent of the theorem. Nonlinear experiments are reported separately as empirical observations without extending the same proof. No self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the modeling assumption that dynamics admit a separable decomposition into system and environment modes with measurable differences in speed and noise; this is a domain assumption rather than a derived result.

axioms (1)

domain assumption The observed dynamics admit a decomposition into system modes and environment modes with distinct temporal scales and noise levels
Invoked to define causal fidelity and to prove that every population-risk minimizer encodes the environment when it is slower or less noisy.

invented entities (1)

predictive-causal gap no independent evidence
purpose: Names the structural mismatch in which optimal predictive encoders allocate sensitivity to environment rather than system degrees of freedom
New concept introduced to unify the theorem and experimental observations; no independent falsifiable prediction outside the paper's own experiments is provided.

pith-pipeline@v0.9.0 · 5577 in / 1430 out tokens · 29653 ms · 2026-05-08T16:23:31.084614+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 4 canonical work pages · 4 internal anchors

[1]

Open Review , year=

A path towards autonomous machine intelligence , author=. Open Review , year=
[2]

Mastering Diverse Domains through World Models

Mastering diverse domains through world models , author=. arXiv:2301.04104 , year=

work page internal anchor Pith review arXiv
[3]

Efficiently Modeling Long Sequences with Structured State Spaces

Efficiently modeling long sequences with structured state spaces , author=. arXiv:2111.00396 , year=

work page internal anchor Pith review arXiv
[4]

Advances in Neural Information Processing Systems , volume=

Neural ordinary differential equations , author=. Advances in Neural Information Processing Systems , volume=
[5]

Progress of Theoretical Physics , volume=

On quantum theory of transport phenomena: steady diffusion , author=. Progress of Theoretical Physics , volume=
[6]

The Journal of Chemical Physics , volume=

Ensemble method in the theory of irreversibility , author=. The Journal of Chemical Physics , volume=
[7]

Proc.\ 37th Allerton Conf.\ on Communication, Control, and Computing , year=

The information bottleneck method , author=. Proc.\ 37th Allerton Conf.\ on Communication, Control, and Computing , year=
[8]

Proceedings of the IEEE , volume=

Toward causal representation learning , author=. Proceedings of the IEEE , volume=
[9]

Proceedings of the 36th International Conference on Machine Learning , pages=

Challenging common assumptions in the unsupervised learning of disentangled representations , author=. Proceedings of the 36th International Conference on Machine Learning , pages=
[10]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv:2001.08361 , year=

work page internal anchor Pith review arXiv 2001
[11]

Training Compute-Optimal Large Language Models

Training compute-optimal large language models , author=. arXiv:2203.15556 , year=

work page internal anchor Pith review arXiv
[12]

Kramers--Kronig relations and causality in non-

Liu, Kejun , journal=. Kramers--Kronig relations and causality in non-. 2026 , archivePrefix=

2026