arxiv: 2605.05862 · v1 · submitted 2026-05-07 · 💻 cs.LG

Recognition: unknown

Do Neural Operators Forget Geometry? The Forgetting Hypothesis in Deep Operator Learning

Yanming Xia , Angelica I. Aviles-Rivero

Authors on Pith no claims yet

Pith reviewed 2026-05-08 14:38 UTC · model grok-4.3

classification 💻 cs.LG

keywords neural operatorsgeometric forgettingoperator learningirregular geometriesMarkovian layersattention operatorsspectral methodsmemory injection

0 comments

The pith

Neural operators lose access to domain geometry as depth increases due to Markovian layers and global mixing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes the Geometric Forgetting Hypothesis to explain why neural operators struggle on irregular domains even when they handle structured ones effectively. It attributes the problem to the Markovian structure of successive operator layers combined with global mixing operations that erode geometric information over depth. Layer-wise probing confirms that both spectral and attention-based operators lose geometric fidelity systematically. This loss directly impairs accuracy, stability, and generalization on non-uniform geometries. A lightweight memory injection that reintroduces geometric constraints at intermediate layers reverses the effect with minimal added cost.

Core claim

Neural operators progressively lose access to domain geometry as depth increases because operator layers are Markovian and depend on global mixing mechanisms; layer-wise geometric probing on spectral and attention-based models shows systematic drops in geometric fidelity that degrade accuracy, stability, and generalization, and this can be countered by injecting geometry memory at intermediate depths to restore constraints.

What carries the argument

Layer-wise geometric probing that tracks fidelity loss across successive operator layers, together with the geometry memory injection that reintroduces domain constraints at chosen depths.

If this is right

Geometric forgetting reduces accuracy, stability, and generalization on irregular domains.
Lightweight geometry memory injection at intermediate depths restores geometric constraints with low overhead.
Transformer-based operators display geometric shortcut instability that the injection exposes.
Geometric retention is a structural requirement for operator learning rather than an optional design choice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same depth-wise erosion of local structure could appear in other global-mixing architectures such as graph transformers or long-sequence models.
The memory injection idea could be tested as a plug-in module for existing operator libraries on new physics simulation tasks.
If the hypothesis holds, architecture search should prioritize mechanisms that preserve geometry at every scale instead of relying solely on deeper mixing.

Load-bearing premise

The observed loss of geometric fidelity in layer-wise probing is caused by the inherent Markovian structure and global mixing mechanisms rather than by training procedure, optimizer choice, or the specific datasets and architectures examined.

What would settle it

An experiment that keeps training procedure, optimizer, datasets, and architecture fixed while measuring geometric fidelity across layers and finds no systematic loss with increasing depth would falsify the forgetting hypothesis.

Figures

Figures reproduced from arXiv: 2605.05862 by Angelica I. Aviles-Rivero, Yanming Xia.

**Figure 1.** Figure 1: Geometric forgetting as an architectural information-flow phenomenon. Left: In standard neural operators, geometry G is only provided at the input. The hidden states {Vl} L l=1 form a Markov chain G → V1 → · · · → VL, implying that geometric information cannot increase with depth. In global mixing layers (FFT, self-attention), this information decays severely, leading to geometric forgetting. Right: Geomet… view at source ↗

**Figure 2.** Figure 2: Examples of the diverse and complex geometries in FlowBench using 9 samples from each of the three groups. The first row corresponds to geometries from the nurbs group G1, the second row to the spherical harmonics group G2, and the third row to the skelneton group G3. 3 FlowBench 3.1 Geometries Our dataset includes three distinct categories of geometries, namely G1, G2, and G3 as illustrated in view at source ↗

**Figure 2.** Figure 2: Impact of Memory Injection (LDC-NSHT). Without memory, FNO loses flow dynamics and Transolver ignores the obstacle. Injecting memory at all layers correct these failure modes view at source ↗

**Figure 3.** Figure 3: Validation of the Forgetting Hypotheses (LDC-NSHT). (a) Reconstruction MSE of the domain mask from hidden states. Numbers in this figure available in Appendix B.1, (b) Representative reconstruction from the Layer 2 hidden state of a standard Transolver, note the blurring in the middle. This empirical analysis reveals that the Forgetting Hypothesis manifests through fundamentally different mechanisms in … view at source ↗

**Figure 4.** Figure 4: Spectral frequency analysis on the LDC-NSHT dataset. Memory is injected at all layers for ‘With Memory’ model.(a) Memory injection helps FNO retain boundary information, (b) Memory injection does not alter Transolver’s spectral domain signal pattern. 4.2.2. INTRINSIC VS. EXTRINSIC MEMORY: THE CASE OF THE LAPLACE NEURAL OPERATOR To validate that the performance gains in FNO and Transolver stem specifically… view at source ↗

**Figure 7.** Figure 7: illustrates a phenomenon we term Geometric Shortcut: In Layer 3 Injection (Purple line), the gradient ratio for the final layer saturates to 1.0, while the ratios for all preceding backbone layers collapse to 0.0. Because the memory injection occurs after Layer 3 (vout = v3 ⊙ γ + β), the optimizer discovers a greedy solution: it relies primarily on the geometric embeddings (γ, β) from the Memory Encoder. C… view at source ↗

**Figure 6.** Figure 6: FNO Layer-wise Gradient Ratios (LDC-NS). Evolution of relative gradient magnitude for different memory injection locations (Mem at L0–3). The network retains significant gradient signals in early layers even when memory is injected at the final layer, indicating stable learning dynamics. analyzed the Gradient Ratio—the relative magnitude of gradients at layer l compared to the total network gradient (Rl =… view at source ↗

**Figure 8.** Figure 8: Flowbench dataset shapes. This Figure is adopted from (Tali et al., 2024) Inputs and Outputs • LDC-NS: The input tuple consists of (Re, SDF, Mask, x), where Re is the Reynolds number, and x are coordinate channels. The output is the velocity field (ux, uy) and pressure p. • LDC-NSHT: The input tuple is expanded to include the Richardson number (Ri), representing the ratio of buoyancy to inertial forces: (R… view at source ↗

**Figure 9.** Figure 9: Visualization of LDC-NS velocity in horizontal direction Ground Truth FNO No Memory FNO Memory All Transolver No Memory Transolver Memory All Mask Error Error Error Error 0.5 0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.00 0.02 0.04 0.06 0.08 0.10 0.12 view at source ↗

**Figure 10.** Figure 10: Visualization of LDC-NS velocity in vertical direction 14 view at source ↗

**Figure 11.** Figure 11: Visualization of LDC-NS pressure C.2. LDC-NSHT Ground Truth FNO No Memory FNO Memory All Transolver No Memory Transolver Memory All Mask Error Error Error Error 0.2 0.0 0.2 0.4 0.6 0.8 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 view at source ↗

**Figure 12.** Figure 12: Visualization of LDC-NSHT velocity in horizontal direction. 15 view at source ↗

**Figure 13.** Figure 13: Visualization of LDC-NSHT pressure. C.3. Darcy Ground Truth FNO No Memory FNO Memory 3 Transolver No Memory Transolver Memory 1 Mask Error Error Error Error 0.0 0.1 0.2 0.3 0.4 0.5 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 view at source ↗

**Figure 14.** Figure 14: Visualization of Darcy result 16 view at source ↗

**Figure 15.** Figure 15: Visualization of AirfRANS velocity in horizontal direction Ground Truth FNO No Memory FNO Memory 0 Transolver No Memory Transolver Memory Early Mask Error Error Error Error 10 0 10 20 30 40 0.0 0.5 1.0 1.5 2.0 2.5 view at source ↗

**Figure 16.** Figure 16: Visualization of AirfRANS velocity in vertical direction 17 view at source ↗

**Figure 17.** Figure 17: Visualization of AirfRANS pressure 18 view at source ↗

read the original abstract

Neural operators perform well on structured domains, yet their behaviour on irregular geometries remains poorly understood. We show that this limitation is not merely an encoding issue, but a depth-wise failure mode inherent to deep operator architectures. We formalise the Geometric Forgetting Hypothesis: due to the Markovian structure of operator layers and their reliance on global mixing mechanisms, neural operators progressively lose access to domain geometry as depth increases. Using layer-wise geometric probing, we demonstrate that both spectral and attention-based operators systematically lose geometric fidelity. We show that this geometric forgetting degrades accuracy, stability, and generalisation. To counteract it, we introduce a lightweight geometry memory injection mechanism that restores geometric constraints at intermediate depths with minimal architectural overhead. This simple intervention consistently mitigates forgetting and exposes a geometric shortcut instability in transformer-based operators, revealing that geometric retention is a structural requirement rather than a design choice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a depth-dependent geometry loss in neural operators and offers a simple injection fix, but the experiments do not isolate the cause from training dynamics.

read the letter

The main takeaway is that neural operators appear to lose access to domain geometry as layers stack up, and the authors propose a lightweight memory injection to restore it at intermediate depths. They formalize this as the Geometric Forgetting Hypothesis, link it to Markovian layer structure plus global mixing like Fourier or attention, and use layer-wise probing to show the fidelity drop. The injection then improves accuracy, stability, and generalization on irregular domains. That combination of hypothesis and minimal fix is the clearest new element here. It does a solid job calling out a practical limitation that matters for scientific computing applications where domains are not regular grids. The proposed mechanism is cheap to add, which makes it worth testing even if the underlying story needs work. The soft spot is that the causal claim is not isolated. The probing is performed on end-to-end trained models, so the observed decay could result from the optimizer, loss, or specific tasks rather than being forced by the architecture itself. Without controls that hold the operator structure fixed while altering training (different initializations, geometry-aware auxiliaries, or synthetic cases where preserving geometry is required for low loss), it is hard to rule out training artifacts. The abstract and summary give no quantitative numbers, error bars, or dataset details, which keeps the strength of the evidence modest. This paper is aimed at researchers working on neural operators for PDEs or physics-informed models on complex geometries. Readers who want quick architectural tweaks to improve retention could find the injection idea useful to try out. It deserves a serious referee because the hypothesis is testable and the fix is reproducible with low overhead; referees can push on the controls and see whether the effect is structural or training-dependent.

Referee Report

2 major / 0 minor

Summary. The manuscript formalizes the Geometric Forgetting Hypothesis, asserting that neural operators progressively lose access to domain geometry with depth due to Markovian layer structure and global mixing mechanisms (Fourier/attention). Layer-wise geometric probing on spectral and attention-based operators is used to demonstrate systematic loss of geometric fidelity, which is claimed to degrade accuracy, stability, and generalization; a lightweight geometry memory injection mechanism is introduced to restore constraints at intermediate depths.

Significance. If the causal attribution to architecture holds and the injection mechanism proves robust, the work would usefully identify a structural limitation in deep operator networks for irregular domains and supply a low-overhead mitigation, potentially informing architecture choices in scientific machine learning. The layer-wise probing approach, if made quantitative and controlled, could become a standard diagnostic tool.

major comments (2)

[Abstract] Abstract: the central claim that forgetting is an inevitable consequence of Markovian structure plus global mixing is not supported by any reported controls that hold architecture fixed while varying training procedure, initialization, or loss (e.g., geometry-aware auxiliary objectives or synthetic tasks where geometry preservation is required for low loss). All described experiments use only end-to-end trained models, leaving open the possibility that the observed depth-wise decay is an optimization artifact rather than a structural necessity.
[Abstract] Abstract and described experiments: no quantitative results, error bars, dataset specifications, probe metrics, or ablation studies are supplied to substantiate the 'systematic loss of geometric fidelity' or its performance impact, rendering the hypothesis unverifiable from the provided evidence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify gaps in experimental controls and the presentation of quantitative evidence for the Geometric Forgetting Hypothesis. We address each point below, agree where the manuscript requires strengthening, and outline specific revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that forgetting is an inevitable consequence of Markovian structure plus global mixing is not supported by any reported controls that hold architecture fixed while varying training procedure, initialization, or loss (e.g., geometry-aware auxiliary objectives or synthetic tasks where geometry preservation is required for low loss). All described experiments use only end-to-end trained models, leaving open the possibility that the observed depth-wise decay is an optimization artifact rather than a structural necessity.

Authors: We agree that the current experiments rely exclusively on end-to-end training and do not include the suggested controls that hold architecture fixed while varying training procedure, initialization, or loss functions. This leaves open the possibility that the depth-wise decay is partly an optimization artifact. While the consistent pattern across spectral and attention-based operators and the layer-wise probing provide architectural motivation for the hypothesis, these do not constitute a rigorous isolation of structure from training dynamics. In the revised manuscript we will add controlled experiments: (i) geometry-aware auxiliary objectives, (ii) synthetic tasks where geometry preservation is necessary for low loss, and (iii) multiple initializations and training schedules with fixed architectures. These additions will directly test whether the forgetting persists under conditions that incentivize geometric retention. revision: yes
Referee: [Abstract] Abstract and described experiments: no quantitative results, error bars, dataset specifications, probe metrics, or ablation studies are supplied to substantiate the 'systematic loss of geometric fidelity' or its performance impact, rendering the hypothesis unverifiable from the provided evidence.

Authors: We acknowledge that the abstract and the high-level experiment description in the submission do not contain the requested quantitative details, error bars, dataset specifications, explicit probe metrics, or ablation tables. The full manuscript body does report layer-wise probe results, benchmark errors, and injection ablations, but these were not sufficiently foregrounded or summarized. We will revise the abstract to include key quantitative statements (e.g., average fidelity decay rates and performance deltas) with references to specific figures and tables. In addition, we will add a dedicated section that clearly defines all probe metrics, lists dataset specifications, reports error bars from repeated runs, and expands the ablation studies on the memory injection mechanism. These changes will make the supporting evidence verifiable without altering the core claims. revision: partial

Circularity Check

0 steps flagged

No circularity: hypothesis is empirical observation from probing, not a closed derivation

full rationale

The paper formalizes the Geometric Forgetting Hypothesis as an interpretation of layer-wise geometric probing results on trained spectral and attention-based operators. No equations or derivations are presented that reduce by construction to fitted inputs, self-definitions, or prior self-citations. The central claims rest on experimental demonstrations of depth-wise fidelity loss rather than any load-bearing theoretical chain that loops back to the paper's own assumptions or data fits. This is a standard empirical framing with independent content from the probes and interventions described.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that operator layers are Markovian and that global mixing erases geometry; the injection mechanism is a new proposed entity without independent evidence outside the paper's experiments.

axioms (1)

domain assumption Operator layers possess Markovian structure that discards prior geometric information
Invoked directly in the formalization of the Geometric Forgetting Hypothesis.

invented entities (1)

geometry memory injection mechanism no independent evidence
purpose: Restore geometric constraints at intermediate depths with minimal overhead
New architectural component introduced to counteract the hypothesized forgetting.

pith-pipeline@v0.9.0 · 5448 in / 1312 out tokens · 40548 ms · 2026-05-08T14:38:13.341911+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 12 canonical work pages · 2 internal anchors

[1]

Neural operator: Graph kernel network for partial differential equations

Anandkumar, A., Azizzadenesheli, K., Bhattacharya, K., Kovachki, N., Li, Z., Liu, B., and Stuart, A. Neural operator: Graph kernel network for partial differential equations. InICLR 2020 workshop on integration of deep neural models and differential equations,

2020
[2]

Learning long-term dependencies with gradient descent is difficult.IEEE Transactions on Neural Networks, 5(2):157–166, 1994

doi: 10.1109/72.279181. Bonnet, F., Mazari, J., Cinnella, P., and Gallinari, P. Airfrans: High fidelity computational fluid dynamics dataset for approximating reynolds-averaged navier–stokes solutions. Advances in Neural Information Processing Systems, 35: 23463–23478,

work page doi:10.1109/72.279181
[3]

On the benefits of memory for modeling time-dependent pdes

Buitrago, R., Marwah, T., Gu, A., and Risteski, A. On the benefits of memory for modeling time-dependent pdes. International Conference on Learning Representations (ICLR), 2025,

2025
[4]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

work page internal anchor Pith review arXiv
[5]

Kossaifi, N

Kossaifi, J., Kovachki, N., Li, Z., Pitt, D., Liu-Schiaffini, M., Duruisseaux, V ., George, R. J., Bonev, B., Aziz- zadenesheli, K., Berner, J., and Anandkumar, A. A library for learning neural operators.arXiv preprint arXiv:2412.10354,

work page arXiv
[6]

Geometric operator learning with optimal transport.arXiv preprint arXiv:2507.20065, 2025

Li, X., Li, Z., Kovachki, N., and Anandkumar, A. Geometric operator learning with optimal transport.arXiv preprint arXiv:2507.20065,

work page arXiv
[7]

and Zhe, S

Li, Y . and Zhe, S. Graph-based operator learning from limited data on irregular domains.arXiv preprint arXiv:2505.18923,

work page arXiv
[8]

Fourier Neural Operator for Parametric Partial Differential Equations

Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Bhat- tacharya, K., Stuart, A., and Anandkumar, A. Fourier neural operator for parametric partial differential equa- tions.arXiv preprint arXiv:2010.08895,

work page internal anchor Pith review arXiv 2010
[9]

Transformers learn shortcuts to automata

Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., and Zhang, C. Transformers learn shortcuts to automata.arXiv preprint arXiv:2210.10749,

work page arXiv
[10]

Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators.Nature Machine Intelligence, 3(3), 218–229 (2021)

ISSN 2522-5839. doi: 10.1038/ s42256-021-00302-5. URL http://dx.doi.org/ 10.1038/s42256-021-00302-5. 9 The Forgetting Hypothesis in Deep Operator Learning Luo, H., Wu, H., Zhou, H., Xing, L., Di, Y ., Wang, J., and Long, M. Transolver++: An accurate neural solver for pdes on million-scale geometries.arXiv preprint arXiv:2502.02414,

work page doi:10.1038/s42256-021-00302-5
[11]

doi: 10.1007/3-7643-7397-0

ISBN 3-7643-7293-1. doi: 10.1007/3-7643-7397-0. Sarkar, S. and Chakraborty, S. Physics-and geometry- aware spatio-spectral graph neural operator for time- independent and time-dependent pdes.arXiv preprint arXiv:2508.09627, 2025a. Sarkar, S. and Chakraborty, S. Spatio-spectral graph neural operator for solving computational mechanics problems on irregular...

work page doi:10.1007/3-7643-7397-0
[12]

Available: https://arxiv.org/abs/2409.18032 15 APPENDIX A

URL https://arxiv.org/abs/ 2409.18032. Taylor, S., Bihlo, A., and Nave, J.-C. Diffeomorphic neu- ral operator learning.arXiv preprint arXiv:2508.06690,

work page arXiv
[13]

H., Sankaran, S., Wang, H., Pappas, G

Wang, S., Seidman, J. H., Sankaran, S., Wang, H., Pap- pas, G. J., and Perdikaris, P. Cvit: Continuous vi- sion transformer for operator learning.arXiv preprint arXiv:2405.13998,

work page arXiv
[14]

J., Li, Z., and Anandkumar, A

Zhao, J., George, R. J., Li, Z., and Anandkumar, A. Incre- mental spectral learning in fourier neural operator.arXiv preprint arXiv:2211.15188,

work page arXiv
[15]

The output is the velocity field(u x, uy)and pressurep

Inputs and Outputs • LDC-NS: The input tuple consists of (Re,SDF,Mask,x) , where Re is the Reynolds number, and x are coordinate channels. The output is the velocity field(u x, uy)and pressurep. • LDC-NSHT: The input tuple is expanded to include the Richardson number (Ri), representing the ratio of buoyancy to inertial forces:(Re, Ri,SDF,Mask,x). The outp...

2025