pith. machine review for the scientific record. sign in

arxiv: 2605.14546 · v1 · submitted 2026-05-14 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Discovering Physical Directions in Weight Space: Composing Neural PDE Experts

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:59 UTC · model grok-4.3

classification 💻 cs.LG
keywords neural operatorsPDE surrogate modelingweight space directionsmodel mergingphysical parameter transferfine-tuningextrapolationNavier-Stokes
0
0 comments X

The pith

Fine-tuning endpoint experts on a shared neural PDE operator reveals a reusable physical direction in weight space for training-free regime composition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that weight updates from fine-tuning a shared neural operator to low- and high-regime endpoints decompose into a family-shared adaptation component plus a direction aligned with the underlying physical parameter. This decomposition reframes the endpoints as finite-difference probes along that direction, which explains why naive averaging produces usable intermediates yet weakens regime-specific physics. The authors introduce Calibration-Conditioned Merge to read out a target coordinate from physical metadata or a short rollout prefix and produce a single merged checkpoint. If the separation holds, practitioners gain a post-hoc method to transfer across PDE regimes without retraining, with the largest reported gains appearing in extrapolation settings. The approach is validated on reaction-diffusion, viscosity-parameterized Navier-Stokes, and radial dam-break systems across multiple operator scales.

Core claim

Starting from a shared family anchor, fine-tuning to low- and high-regime endpoints separates the resulting weight updates into a family-shared adaptation and a direction aligned with the physical parameter. Endpoint experts therefore function as finite-difference probes of a local physical direction in weight space. This perspective motivates Calibration-Conditioned Merge, which infers a composition coordinate from physical metadata, a calibrated mapping, or a short observed rollout prefix and deploys a single merged checkpoint for the remaining rollout. On the evaluated benchmarks the method reduces out-of-distribution rollout error relative to the family anchor by 54.2 percent, 42.8 per­-

What carries the argument

Calibration-Conditioned Merge (CCM), a post-hoc coordinate readout that composes neural PDE experts along the discovered physical direction in weight space using metadata or a rollout prefix.

If this is right

  • Static averaging of endpoint experts attenuates regime-specific physics and yields higher error than direction-aware merging.
  • A single merged checkpoint suffices for the full rollout once the composition coordinate is inferred from metadata or a short prefix.
  • Error reductions are largest in extrapolative regimes lying outside the fine-tuned endpoints.
  • The physical direction remains consistent when the underlying operator is scaled or replaced by a DPOT-style backbone.
  • Endpoint fine-tuning produces reusable structure rather than isolated regime experts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the direction proves approximately linear, the same separation could support zero-shot adaptation to continuous physical parameters never seen during fine-tuning.
  • Analogous directions may exist in other continuous-attribute domains such as vision models conditioned on scale or physics-informed language models.
  • Extending the approach to three-dimensional or coupled multiphysics systems would test whether the separation survives increased complexity.

Load-bearing premise

The observed separation of weight updates into a family-shared part and a physical-parameter-aligned direction is stable across fine-tuning procedures and generalizes beyond the tested regimes.

What would settle it

If the vector difference between high- and low-regime fine-tuned weights, after removal of the shared adaptation component, fails to produce accurate merged predictions for an unseen intermediate physical parameter when used as the CCM direction, while independent retraining succeeds.

Figures

Figures reproduced from arXiv: 2605.14546 by Dong Ni, Guanyu Chen, Pengkai Wang, Pengwei Liu, Qixin Zhang, Xiaolong Li, Xingyu Ren, Yuanyi Wang, Yuting Kong, Zhongkai Hao.

Figure 1
Figure 1. Figure 1: Endpoint residuals encode physical coordinates. Decomposing endpoint–anchor updates isolates a shared solver adaptation and a signed physical direction. This learned direction can align with physical coordinates and support extrapolative rollouts. Reversed-ordering controls test whether these gains arise from physical orientation rather than arbitrary interpolation. Adaptation across physical regimes remai… view at source ↗
Figure 2
Figure 2. Figure 2: Calibration-Conditioned Merge framework. A shared family anchor θ0 is trained on support regimes and fine-tuned into low/high endpoint experts. Their endpoint–anchor residuals define a shared adaptation ∆+ and a signed physical direction ∆−. CCM selects a target coordinate αˆ from metadata, scale calibration, or a short rollout prefix, and instantiates one checkpoint ˆθλ = θ0 + ∆+ + ˆα∆− for rollout. rollo… view at source ↗
Figure 3
Figure 3. Figure 3: Shared and signed endpoint structure on controlled DiffReact axes. Left: endpoint averaging is effective on merge-friendly axes, indicating a reusable shared component ∆+. Right: the f3 curve is minimized at α = 0, whereas the harder f2 medium-gap setting prefers a nonzero signed family direction ∆− [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Coordinate law and endpoint-line smoothness across PDE families. Panels A, C, and E present a comparison between the normalized physical coordinate and the corresponding diagnostic￾oracle-derived coordinate. Panels B, D, and F display the normalized excess loss evaluated along the same endpoint-defined coordinate axis. DiffReact and NS2D show strong coordinate alignment, whereas RDB exhibits weaker alignme… view at source ↗
Figure 5
Figure 5. Figure 5: Cross-domain evidence atlas. A–B show coordinate alignment; C, RDB prefix selection; D, wrong-sign coordinate control; E, matched-seed independent reruns; F–H, rollout error comparisons; I, auxiliary diagnostic reductions; J, the FNO scale trend. Matched-seed reruns retain the qualitative advantages of both main strategies: metadata-based coordinate selection on DiffReact and prefix-based selection on RDB.… view at source ↗
Figure 6
Figure 6. Figure 6: Large-FNO DiffReact f3 α sweep. E.4 DPOT-style backbone validation The DPOT-style line evaluates the same composition rule outside the FNO architecture. The base reaches 0.0447, average merge improves to 0.0391, and CCM-Coord reaches 0.0183, matching the oracle on this small coordinate bank [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: NS2D prefix-calibration budget. E.7 RDB fixed alpha versus conditional composition RDB is the least directly parameterized domain because the free-surface transient is not well described by a monotone scalar metadata coordinate. The main text therefore uses CCM-Prefix for this setting. The selected α varies by test sample and task, using the short prefix to choose the correct side of the endpoint direction… view at source ↗
Figure 8
Figure 8. Figure 8: RDB fixed-α frontier versus conditional CCM-Prefix. 1 2 3 4 5 6 7 8 prefix steps 0.1 0.2 0.3 0.4 0.5 Future-only L2 RDB calibration budget Overall Endpoint Worst [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: RDB calibration-budget curve. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: RDB task-level selected α. E.8 Unified method ablation The unified ablation compares static averaging, best fixed α, wrong fixed α, and conditional CCM on NS2D and RDB. It measures whether conditioning changes the selected point on the merge line beyond what one fixed interpolation coefficient can provide. Conditioning matters most when one fixed α cannot serve all regimes [PITH_FULL_IMAGE:figures/full_f… view at source ↗
Figure 11
Figure 11. Figure 11: Late-time DiffReact f2 medium-gap visualization. Rows correspond to four medium￾gap Du tasks for the same seed. Columns show late rollout frames from t = 3.50 to t = 5.00; the u channel is shown for compactness. MG-low nu=4e-5 t=3.55 t=3.90 t=4.25 t=4.60 t=5.00 MG-high nu=2.2e-4 NS2D viscosity family, seed 100128 Vorticity snapshots [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Late-time NS2D medium-gap visualization. Rows correspond to medium-gap low/high viscosity tasks for the same seed. Columns show late rollout frames from t = 3.55 to t = 5.00. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Late-time RDB medium-gap visualization. Rows correspond to four medium-gap height tasks for the same seed. Columns show late rollout frames from t = 0.71 to t = 1.00. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_13.png] view at source ↗
read the original abstract

Recent advances in neural operators have made partial differential equation (PDE) surrogate modeling increasingly scalable and transferable through large-scale pretraining and in-context adaptation. However, after a shared operator is fine-tuned to multiple regimes within a continuous physical family, it remains unclear whether the resulting weight-space updates merely form isolated regime experts or reveal reusable physical structure. Starting from a shared family anchor, we fine-tune low- and high-regime endpoint experts and show that their updates can be separated into a family-shared adaptation and a direction aligned with the underlying physical parameter. This separation reinterprets endpoint experts as finite-difference probes of a local physical direction in weight space, explaining why static averaging can interpolate between regimes but attenuates endpoint-specific physics. Building on this perspective, we propose Calibration-Conditioned Merge (CCM), a post-hoc coordinate readout method for composing neural PDE experts along this physical direction. Given physical metadata, a calibrated coordinate mapping, or a short observed rollout prefix, CCM infers the target composition coordinate and deploys a single merged checkpoint for the remaining rollout. We evaluate CCM on the reaction--diffusion system, viscosity-parameterized two-dimensional Navier--Stokes equations, and radial dam-break dynamics. Across these benchmarks, CCM achieves its strongest gains in extrapolative regimes, reducing out-of-distribution rollout error relative to the family anchor by 54.2%, 42.8%, and 13.8%, respectively. Further experiments across FNO scales, a DPOT-style backbone, and ablations confirm that endpoint fine-tuning is not arbitrary checkpoint drift, but reveals a calibratable physical direction for training-free transfer across PDE regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that fine-tuning a shared neural operator to low- and high-regime endpoints within a continuous physical family separates weight updates into a family-shared adaptation component and a direction aligned with the underlying physical parameter. This reinterpretation motivates the Calibration-Conditioned Merge (CCM) method, which uses physical metadata, a calibrated coordinate mapping, or a short rollout prefix to infer a composition coordinate and deploy a merged checkpoint for training-free transfer. Evaluations on reaction-diffusion, viscosity-parameterized 2D Navier-Stokes, and radial dam-break dynamics report out-of-distribution rollout error reductions of 54.2%, 42.8%, and 13.8% relative to the family anchor, with further ablations across FNO scales and a DPOT-style backbone.

Significance. If the claimed physical direction proves robust rather than procedure-dependent, the work could offer a principled mechanism for composing neural PDE experts along continuous physical parameters, enabling efficient extrapolation without retraining. The empirical gains in extrapolative regimes across three distinct benchmarks indicate potential practical value for scalable surrogate modeling, though the absence of controls for optimization details limits current confidence in the separation's stability.

major comments (2)
  1. [Abstract] The separation of updates into family-shared adaptation and physical direction is performed by taking the difference between low- and high-regime endpoint fine-tunings relative to the shared anchor (as described in the abstract). This vector is then treated as aligned with the physical parameter for CCM composition. For the claim to hold, the direction must be dominated by the parameter change and insensitive to optimization details, yet the abstract reports ablations only across backbones and scales with no explicit controls varying learning rate, step count, optimizer, or anchor perturbation.
  2. [Experiments] The reported error reductions (54.2%, 42.8%, 13.8%) are presented without error bars, details on data splits, or ablation controls for the CCM method versus the family anchor. This makes it difficult to assess whether the gains are statistically reliable or sensitive to the specific fine-tuning trajectories used to discover the direction.
minor comments (1)
  1. [Abstract] The abstract refers to 'a calibrated coordinate mapping' and 'short observed rollout prefix' for inferring the composition coordinate, but the precise functional form of the readout and how it is fitted from metadata or prefix data is not equationally specified in the provided summary.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the robustness of the claimed physical direction and the statistical presentation of results. We address each major comment below and commit to revisions that strengthen the evidence without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract] The separation of updates into family-shared adaptation and physical direction is performed by taking the difference between low- and high-regime endpoint fine-tunings relative to the shared anchor (as described in the abstract). This vector is then treated as aligned with the physical parameter for CCM composition. For the claim to hold, the direction must be dominated by the parameter change and insensitive to optimization details, yet the abstract reports ablations only across backbones and scales with no explicit controls varying learning rate, step count, optimizer, or anchor perturbation.

    Authors: We agree that explicit controls for optimization details would strengthen the interpretation that the discovered direction is dominated by the physical parameter rather than fine-tuning procedure. The existing ablations across FNO scales and a DPOT-style backbone already show consistency, but they do not vary learning rate, step count, optimizer, or anchor initialization. In the revised manuscript we will add a dedicated ablation subsection that systematically varies these factors on at least one benchmark and reports the resulting direction stability (measured by cosine similarity to the original direction and downstream CCM error). revision: yes

  2. Referee: [Experiments] The reported error reductions (54.2%, 42.8%, 13.8%) are presented without error bars, details on data splits, or ablation controls for the CCM method versus the family anchor. This makes it difficult to assess whether the gains are statistically reliable or sensitive to the specific fine-tuning trajectories used to discover the direction.

    Authors: We acknowledge that the current manuscript lacks error bars, explicit data-split descriptions, and direct statistical comparisons of CCM against the family anchor. In the revision we will (i) recompute all reported rollout errors over at least five independent random seeds and include standard-error bars, (ii) add a table specifying train/validation/test splits and trajectory counts for each benchmark, and (iii) include an ablation table that directly contrasts CCM against the anchor with paired statistical tests (e.g., Wilcoxon signed-rank) to quantify significance of the observed gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical discovery of weight-space directions remains self-contained

full rationale

The paper presents an empirical procedure: fine-tune endpoint experts from a shared anchor, observe that their difference vector separates family-shared adaptation from a direction that empirically aligns with the physical parameter, then deploy CCM as a post-hoc linear composition along that observed vector. No equation or derivation reduces the claimed physical direction to a fitted quantity defined from the same evaluation data by construction; the alignment is tested via rollout error on held-out and extrapolative regimes rather than being tautological. Self-citations are not load-bearing for the central claim, and no ansatz or uniqueness theorem is smuggled in to force the result. The method is therefore a coordinate readout on an independently observed vector, not a self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text. The physical direction is presented as discovered rather than postulated.

pith-pipeline@v0.9.0 · 5623 in / 1197 out tokens · 37317 ms · 2026-05-15T01:59:33.617795+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 5 internal anchors

  1. [1]

    Fourier Neural Operator for Parametric Partial Differential Equations

    Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differen- tial equations.arXiv preprint arXiv:2010.08895, 2020

  2. [2]

    Neural operators for accelerating scientific simulations and design

    Kamyar Azizzadenesheli, Nikola Kovachki, Zongyi Li, Miguel Liu-Schiaffini, Jean Kossaifi, and Anima Anandkumar. Neural operators for accelerating scientific simulations and design. Nature Reviews Physics, 6(5):320–328, 2024

  3. [3]

    Gnot: A general neural operator transformer for operator learning

    Zhongkai Hao, Zhengyi Wang, Hang Su, Chengyang Ying, Yinpeng Dong, Songming Liu, Ze Cheng, Jian Song, and Jun Zhu. Gnot: A general neural operator transformer for operator learning. InInternational conference on machine learning, pages 12556–12569. PMLR, 2023

  4. [4]

    Laplace neural operator for solving differential equations.Nature Machine Intelligence, 6(6):631–640, 2024

    Qianying Cao, Somdatta Goswami, and George Em Karniadakis. Laplace neural operator for solving differential equations.Nature Machine Intelligence, 6(6):631–640, 2024

  5. [5]

    Neural Operator: Graph Kernel Network for Partial Differential Equations

    Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Graph kernel network for partial differential equations.arXiv preprint arXiv:2003.03485, 2020

  6. [6]

    Alias-free mamba neural operator.Advances in Neural Information Processing Systems, 37:52962–52995, 2024

    Jianwei Zheng, Wei Li, Ni Xu, Junwei Zhu, Xiaoxu Lin, and Xiaoqin Zhang. Alias-free mamba neural operator.Advances in Neural Information Processing Systems, 37:52962–52995, 2024

  7. [7]

    Aerogto: An efficient graph-transformer operator for learning large-scale aerodynamics of 3d vehicle geometries

    Pengwei Liu, Pengkai Wang, Xingyu Ren, Hangjie Yuan, Zhongkai Hao, Chao Xu, Shengze Cai, and Dong Ni. Aerogto: An efficient graph-transformer operator for learning large-scale aerodynamics of 3d vehicle geometries. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 18924–18932, 2025

  8. [8]

    An efficient graph-transformer operator for learning physical dynamics with manifolds embedding.arXiv preprint arXiv:2512.10227, 2025

    Pengwei Liu, Xingyu Ren, Pengkai Wang, Hangjie Yuan, Zhongkai Hao, Guanyu Chen, Chao Xu, Dong Ni, and Shengze Cai. An efficient graph-transformer operator for learning physical dynamics with manifolds embedding.arXiv preprint arXiv:2512.10227, 2025

  9. [9]

    Foundation neural operators: A survey on pretraining methods, the data ecosystem, and efficient adaptation

    Xingyu Ren, Pengkai Wang, Pengwei Liu, Xihang Yue, Huanshuo Dong, Zhenxin Huang, Zhongkai Hao, Ziqian Hu, Zhen Huang, Yian Wang, et al. Foundation neural operators: A survey on pretraining methods, the data ecosystem, and efficient adaptation. 2026

  10. [10]

    Uncertainty- informed meta pseudo labeling for surrogate modeling with limited labeled data

    Xingyu Ren, Pengwei Liu, Pengkai Wang, Guanyu Chen, Qinxin Wu, and Dong Ni. Uncertainty- informed meta pseudo labeling for surrogate modeling with limited labeled data. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  11. [11]

    Fourcastnet: Accel- erating global high-resolution weather forecasting using adaptive fourier neural operators

    Thorsten Kurth, Shashank Subramanian, Peter Harrington, Jaideep Pathak, Morteza Mardani, David Hall, Andrea Miele, Karthik Kashinath, and Anima Anandkumar. Fourcastnet: Accel- erating global high-resolution weather forecasting using adaptive fourier neural operators. In Proceedings of the platform for advanced scientific computing conference, pages 1–11, 2023

  12. [12]

    Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023

    Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, et al. Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023

  13. [13]

    Learning nonlinear operators via deeponet based on the universal approximation theorem of operators

    Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George Em Karniadakis. Learning nonlinear operators via deeponet based on the universal approximation theorem of operators. Nature machine intelligence, 3(3):218–229, 2021

  14. [14]

    Neural operator: Learning maps between function spaces with applications to pdes.Journal of Machine Learning Research, 24(89):1–97, 2023

    Nikola Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Learning maps between function spaces with applications to pdes.Journal of Machine Learning Research, 24(89):1–97, 2023

  15. [15]

    Dpot: auto-regressive denoising operator transformer for large-scale pde pre-training

    Zhongkai Hao, Chang Su, Songming Liu, Julius Berner, Chengyang Ying, Hang Su, Anima Anandkumar, Jian Song, and Jun Zhu. Dpot: auto-regressive denoising operator transformer for large-scale pde pre-training. InProceedings of the 41st International Conference on Machine Learning, pages 17616–17635, 2024. 10

  16. [16]

    Poseidon: Efficient foundation models for pdes

    Maximilian Herde, Bogdan Raoni ´c, Tobias Rohner, Roger Käppeli, Roberto Molinaro, Em- manuel De Bezenac, and Siddhartha Mishra. Poseidon: Efficient foundation models for pdes. Advances in Neural Information Processing Systems, 37:72525–72624, 2024

  17. [17]

    Pretraining codomain attention neural operators for solving multiphysics pdes.Advances in Neural Information Processing Systems, 37:104035–104064, 2024

    Ashiqur Rahman, Robert J George, Mogab Elleithy, Daniel Leibovici, Zongyi Li, Boris Bonev, Colin White, Julius Berner, Raymond A Yeh, Jean Kossaifi, et al. Pretraining codomain attention neural operators for solving multiphysics pdes.Advances in Neural Information Processing Systems, 37:104035–104064, 2024

  18. [18]

    Multiple physics pretraining for spatiotemporal surrogate models.Advances in Neural Information Processing Systems, 37:119301–119335, 2024

    Michael McCabe, Bruno Régaldo-Saint Blancard, Liam Parker, Ruben Ohana, Miles Cranmer, Alberto Bietti, Michael Eickenberg, Siavash Golkar, Geraud Krawezik, Francois Lanusse, et al. Multiple physics pretraining for spatiotemporal surrogate models.Advances in Neural Information Processing Systems, 37:119301–119335, 2024

  19. [19]

    Walrus: A cross-domain foun- dation model for continuum dynamics.arXiv preprint arXiv:2511.15684, 2025

    Michael McCabe, Payel Mukhopadhyay, Tanya Marwah, Bruno Regaldo-Saint Blancard, Fran- cois Rozet, Cristiana Diaconu, Lucas Meyer, Kaze WK Wong, Hadi Sotoudeh, Alberto Bietti, et al. Walrus: A cross-domain foundation model for continuum dynamics.arXiv preprint arXiv:2511.15684, 2025

  20. [20]

    Neural general circulation models for weather and climate.Nature, 632(8027):1060–1066, 2024

    Dmitrii Kochkov, Janni Yuval, Ian Langmore, Peter Norgaard, Jamie Smith, Griffin Mooers, Milan Klöwer, James Lottes, Stephan Rasp, Peter Düben, et al. Neural general circulation models for weather and climate.Nature, 632(8027):1060–1066, 2024

  21. [21]

    A foundation model for the earth system.Nature, 641(8065):1180–1187, 2025

    Cristian Bodnar, Wessel P Bruinsma, Ana Lucic, Megan Stanley, Anna Allen, Johannes Brand- stetter, Patrick Garvan, Maik Riechert, Jonathan A Weyn, Haiyu Dong, et al. A foundation model for the earth system.Nature, 641(8065):1180–1187, 2025

  22. [22]

    Pdebench: An extensive benchmark for scientific machine learning.Advances in neural information processing systems, 35:1596–1611, 2022

    Makoto Takamoto, Timothy Praditia, Raphael Leiteritz, Daniel MacKinlay, Francesco Alesiani, Dirk Pflüger, and Mathias Niepert. Pdebench: An extensive benchmark for scientific machine learning.Advances in neural information processing systems, 35:1596–1611, 2022

  23. [23]

    The well: a large-scale collection of diverse physics simulations for machine learning.Advances in Neural Information Processing Systems, 37:44989–45037, 2024

    Ruben Ohana, Michael McCabe, Lucas Meyer, Rudy Morel, Fruzsina J Agocs, Miguel Beneitez, Marsha Berger, Blakesley Burkhart, Stuart B Dalziel, Drummond B Fielding, et al. The well: a large-scale collection of diverse physics simulations for machine learning.Advances in Neural Information Processing Systems, 37:44989–45037, 2024

  24. [24]

    Apebench: A benchmark for autoregressive neural emulators of pdes.Advances in Neural Information Processing Systems, 37:120252–120310, 2024

    Felix Koehler, Simon Niedermayr, Rüdiger Westermann, and Nils Thuerey. Apebench: A benchmark for autoregressive neural emulators of pdes.Advances in Neural Information Processing Systems, 37:120252–120310, 2024

  25. [25]

    Learning mesh-based simulation with graph networks.arXiv preprint arXiv:2010.03409, 2020

    Tobias Pfaff, Meire Fortunato, Alvaro Sanchez-Gonzalez, and Peter W Battaglia. Learning mesh-based simulation with graph networks.arXiv preprint arXiv:2010.03409, 2020

  26. [26]

    Magnet: Mesh agnostic neural pde solver.Advances in Neural Information Processing Systems, 35:31972– 31985, 2022

    Oussama Boussif, Yoshua Bengio, Loubna Benabbou, and Dan Assouline. Magnet: Mesh agnostic neural pde solver.Advances in Neural Information Processing Systems, 35:31972– 31985, 2022

  27. [27]

    Deep transfer operator learning for partial differential equations under conditional shift.Nature Machine Intelligence, 4(12):1155–1164, 2022

    Somdatta Goswami, Katiana Kontolati, Michael D Shields, and George Em Karniadakis. Deep transfer operator learning for partial differential equations under conditional shift.Nature Machine Intelligence, 4(12):1155–1164, 2022

  28. [28]

    Fourier neural operator with learned deformations for pdes on general geometries.Journal of Machine Learning Research, 24(388):1–26, 2023

    Zongyi Li, Daniel Zhengyu Huang, Burigede Liu, and Anima Anandkumar. Fourier neural operator with learned deformations for pdes on general geometries.Journal of Machine Learning Research, 24(388):1–26, 2023

  29. [29]

    Geometry-informed neural operator for large-scale 3d pdes.Advances in Neural Information Processing Systems, 36:35836–35854, 2023

    Zongyi Li, Nikola Kovachki, Chris Choy, Boyi Li, Jean Kossaifi, Shourya Otta, Moham- mad Amin Nabian, Maximilian Stadler, Christian Hundt, Kamyar Azizzadenesheli, et al. Geometry-informed neural operator for large-scale 3d pdes.Advances in Neural Information Processing Systems, 36:35836–35854, 2023

  30. [30]

    Group equivariant fourier neural operators for partial differential equations.arXiv preprint arXiv:2306.05697, 2023

    Jacob Helwig, Xuan Zhang, Cong Fu, Jerry Kurtin, Stephan Wojtowytsch, and Shuiwang Ji. Group equivariant fourier neural operators for partial differential equations.arXiv preprint arXiv:2306.05697, 2023. 11

  31. [31]

    Latent neural operator for solving forward and inverse pde problems.Advances in Neural Information Processing Systems, 37:33085–33107, 2024

    Tian Wang and Chuang Wang. Latent neural operator for solving forward and inverse pde problems.Advances in Neural Information Processing Systems, 37:33085–33107, 2024

  32. [32]

    A scalable framework for learning the geometry-dependent solution operators of partial differential equations.Nature computational science, 4(12):928–940, 2024

    Minglang Yin, Nicolas Charon, Ryan Brody, Lu Lu, Natalia Trayanova, and Mauro Maggioni. A scalable framework for learning the geometry-dependent solution operators of partial differential equations.Nature computational science, 4(12):928–940, 2024

  33. [33]

    Diffeomorphism neural operator for various domains and parameters of partial differential equations.Communications Physics, 8(1):15, 2025

    Zhiwei Zhao, Changqing Liu, Yingguang Li, Zhibin Chen, and Xu Liu. Diffeomorphism neural operator for various domains and parameters of partial differential equations.Communications Physics, 8(1):15, 2025

  34. [34]

    Transolver: A Fast Transformer Solver for PDEs on General Geometries

    Haixu Wu, Huakun Luo, Haowen Wang, Jianmin Wang, and Mingsheng Long. Transolver: A fast transformer solver for pdes on general geometries.arXiv preprint arXiv:2402.02366, 2024

  35. [35]

    Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

    Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InInternational conference on machine learning, pages 23965–23998. P...

  36. [36]

    Merging models with fisher-weighted averaging

    Michael S Matena and Colin A Raffel. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716, 2022

  37. [37]

    Editing Models with Task Arithmetic

    Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089, 2022

  38. [38]

    Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023

    Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023

  39. [39]

    Adamerging: Adaptive model merging for multi-task learning.arXiv preprint arXiv:2310.02575,

    Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging: Adaptive model merging for multi-task learning.arXiv preprint arXiv:2310.02575, 2023

  40. [40]

    Language models are super mario: Absorbing abilities from homologous models as a free lunch

    Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InForty-first International Conference on Machine Learning, 2024

  41. [41]

    Model Merging Scaling Laws in Large Language Models

    Yuanyi Wang, Yanggan Gu, Yiming Zhang, Qi Zhou, Zhaoyi Yan, Congkai Xie, Xinyao Wang, Jianbo Yuan, and Hongxia Yang. Model merging scaling laws in large language models.arXiv preprint arXiv:2509.24244, 2025

  42. [42]

    Infifpo: Implicit model fusion via preference optimization in large language models.arXiv preprint arXiv:2505.13878, 2025

    Yanggan Gu, Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Fei Wu, and Hongxia Yang. Infifpo: Implicit model fusion via preference optimization in large language models.arXiv preprint arXiv:2505.13878, 2025

  43. [43]

    Infigfusion: Graph-on-logits distillation via efficient gromov-wasserstein for model fusion

    Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Yanggan Gu, Fei Wu, and Hongxia Yang. Infigfusion: Graph-on-logits distillation via efficient gromov-wasserstein for model fusion. arXiv preprint arXiv:2505.13893, 2025

  44. [44]

    Multiwavelet-based operator learning for differential equations.Advances in neural information processing systems, 34:24048–24062, 2021

    Gaurav Gupta, Xiongye Xiao, and Paul Bogdan. Multiwavelet-based operator learning for differential equations.Advances in neural information processing systems, 34:24048–24062, 2021

  45. [45]

    Choose a transformer: Fourier or galerkin.Advances in neural information processing systems, 34:24924–24940, 2021

    Shuhao Cao. Choose a transformer: Fourier or galerkin.Advances in neural information processing systems, 34:24924–24940, 2021

  46. [46]

    Factorized fourier neural operators.arXiv preprint arXiv:2111.13802, 2021

    Alasdair Tran, Alexander Mathews, Lexing Xie, and Cheng Soon Ong. Factorized fourier neural operators.arXiv preprint arXiv:2111.13802, 2021

  47. [47]

    Message passing neural pde solvers

    Johannes Brandstetter, Daniel Worrall, and Max Welling. Message passing neural pde solvers. arXiv preprint arXiv:2202.03376, 2022. 12

  48. [48]

    Mgno: Efficient parameterization of linear operators via multigrid.arXiv preprint arXiv:2310.19809, 2023

    Juncai He, Xinliang Liu, and Jinchao Xu. Mgno: Efficient parameterization of linear operators via multigrid.arXiv preprint arXiv:2310.19809, 2023

  49. [49]

    Accurate medium-range global weather forecasting with 3d neural networks.Nature, 619(7970):533–538, 2023

    Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. Accurate medium-range global weather forecasting with 3d neural networks.Nature, 619(7970):533–538, 2023

  50. [50]

    End- to-end data-driven weather prediction.Nature, 641(8065):1172–1179, 2025

    Anna Allen, Stratis Markou, Will Tebbutt, James Requeima, Wessel P Bruinsma, Tom R Andersson, Michael Herzog, Nicholas D Lane, Matthew Chantry, J Scott Hosking, et al. End- to-end data-driven weather prediction.Nature, 641(8065):1172–1179, 2025

  51. [51]

    Mpg: An efficient multi-scale point- based gnn for non-uniform meshes

    Qinxin Wu, Pengwei Liu, Xingyu Ren, and Dong Ni. Mpg: An efficient multi-scale point- based gnn for non-uniform meshes. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 3–18. Springer, 2025

  52. [52]

    In-context operator learning with data prompts for differential equation problems.Proceedings of the National Academy of Sciences, 120(39):e2310142120, 2023

    Liu Yang, Siting Liu, Tingwei Meng, and Stanley J Osher. In-context operator learning with data prompts for differential equation problems.Proceedings of the National Academy of Sciences, 120(39):e2310142120, 2023

  53. [53]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  54. [54]

    Model fusion via optimal transport.Advances in Neural Information Processing Systems, 33:22045–22055, 2020

    Sidak Pal Singh and Martin Jaggi. Model fusion via optimal transport.Advances in Neural Information Processing Systems, 33:22045–22055, 2020

  55. [55]

    K., Hayase, J., and Srinivasa, S

    Samuel K Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries.arXiv preprint arXiv:2209.04836, 2022

  56. [56]

    Task arithmetic in the tangent space: Improved editing of pre-trained models.Advances in Neural Information Processing Systems, 36:66727–66754, 2023

    Guillermo Ortiz-Jimenez, Alessandro Favero, and Pascal Frossard. Task arithmetic in the tangent space: Improved editing of pre-trained models.Advances in Neural Information Processing Systems, 36:66727–66754, 2023

  57. [57]

    Emr-merging: Tuning-free high-performance model merging.Advances in Neural Information Processing Systems, 37:122741–122769, 2024

    Chenyu Huang, Peng Ye, Tao Chen, Tong He, Xiangyu Yue, and Wanli Ouyang. Emr-merging: Tuning-free high-performance model merging.Advances in Neural Information Processing Systems, 37:122741–122769, 2024

  58. [58]

    Calm: Consensus-aware localized merging for multi-task learning.arXiv preprint arXiv:2506.13406, 2025

    Kunda Yan, Min Zhang, Sen Cui, Zikun Qu, Bo Jiang, Feng Liu, and Changshui Zhang. Calm: Consensus-aware localized merging for multi-task learning.arXiv preprint arXiv:2506.13406, 2025

  59. [59]

    Map: Low-compute model merging with amortized pareto fronts via quadratic approximation.arXiv preprint arXiv:2406.07529, 2024

    Lu Li, Tianyu Zhang, Zhiqi Bu, Suyuchen Wang, Huan He, Jie Fu, Yonghui Wu, Jiang Bian, Yong Chen, and Yoshua Bengio. Map: Low-compute model merging with amortized pareto fronts via quadratic approximation.arXiv preprint arXiv:2406.07529, 2024. 13 Appendix Contents A Theory for Controlled PDE Merge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....

  60. [60]

    there is a solutionu(a, µ)withR(u(a, µ), a, µ) = 0; 3.D uR(u(a, µ), a, µ) :U → Zis invertible with uniformly bounded inverse

  61. [61]

    Complete prefix L2

    the first and second derivatives of R entering the sensitivity equations are bounded by a ρ-square-integrable envelope. 18 Thenµ7→ S µ ∈ H ρ isC 2. Moreover, sup µ ∂2 µµSµ ρ <∞. Consequently, forµ(s) =µ c +hs, d2 ds2 Sµ(s) =h 2∂2 µµSµ(s), K S ≲|h| 2. Proof. For each admissible input a, the implicit function theorem in Banach spaces ensures the local C2-re...