arxiv: 2605.14546 · v1 · submitted 2026-05-14 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Discovering Physical Directions in Weight Space: Composing Neural PDE Experts

Pengkai Wang , Pengwei Liu , Yuanyi Wang , Guanyu Chen , Xingyu Ren , Xiaolong Li , Zhongkai Hao , Yuting Kong

show 2 more authors

Qixin Zhang Dong Ni

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:59 UTC · model grok-4.3

classification 💻 cs.LG

keywords neural operatorsPDE surrogate modelingweight space directionsmodel mergingphysical parameter transferfine-tuningextrapolationNavier-Stokes

0 comments

The pith

Fine-tuning endpoint experts on a shared neural PDE operator reveals a reusable physical direction in weight space for training-free regime composition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that weight updates from fine-tuning a shared neural operator to low- and high-regime endpoints decompose into a family-shared adaptation component plus a direction aligned with the underlying physical parameter. This decomposition reframes the endpoints as finite-difference probes along that direction, which explains why naive averaging produces usable intermediates yet weakens regime-specific physics. The authors introduce Calibration-Conditioned Merge to read out a target coordinate from physical metadata or a short rollout prefix and produce a single merged checkpoint. If the separation holds, practitioners gain a post-hoc method to transfer across PDE regimes without retraining, with the largest reported gains appearing in extrapolation settings. The approach is validated on reaction-diffusion, viscosity-parameterized Navier-Stokes, and radial dam-break systems across multiple operator scales.

Core claim

Starting from a shared family anchor, fine-tuning to low- and high-regime endpoints separates the resulting weight updates into a family-shared adaptation and a direction aligned with the physical parameter. Endpoint experts therefore function as finite-difference probes of a local physical direction in weight space. This perspective motivates Calibration-Conditioned Merge, which infers a composition coordinate from physical metadata, a calibrated mapping, or a short observed rollout prefix and deploys a single merged checkpoint for the remaining rollout. On the evaluated benchmarks the method reduces out-of-distribution rollout error relative to the family anchor by 54.2 percent, 42.8 per-

What carries the argument

Calibration-Conditioned Merge (CCM), a post-hoc coordinate readout that composes neural PDE experts along the discovered physical direction in weight space using metadata or a rollout prefix.

If this is right

Static averaging of endpoint experts attenuates regime-specific physics and yields higher error than direction-aware merging.
A single merged checkpoint suffices for the full rollout once the composition coordinate is inferred from metadata or a short prefix.
Error reductions are largest in extrapolative regimes lying outside the fine-tuned endpoints.
The physical direction remains consistent when the underlying operator is scaled or replaced by a DPOT-style backbone.
Endpoint fine-tuning produces reusable structure rather than isolated regime experts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the direction proves approximately linear, the same separation could support zero-shot adaptation to continuous physical parameters never seen during fine-tuning.
Analogous directions may exist in other continuous-attribute domains such as vision models conditioned on scale or physics-informed language models.
Extending the approach to three-dimensional or coupled multiphysics systems would test whether the separation survives increased complexity.

Load-bearing premise

The observed separation of weight updates into a family-shared part and a physical-parameter-aligned direction is stable across fine-tuning procedures and generalizes beyond the tested regimes.

What would settle it

If the vector difference between high- and low-regime fine-tuned weights, after removal of the shared adaptation component, fails to produce accurate merged predictions for an unseen intermediate physical parameter when used as the CCM direction, while independent retraining succeeds.

Figures

Figures reproduced from arXiv: 2605.14546 by Dong Ni, Guanyu Chen, Pengkai Wang, Pengwei Liu, Qixin Zhang, Xiaolong Li, Xingyu Ren, Yuanyi Wang, Yuting Kong, Zhongkai Hao.

**Figure 1.** Figure 1: Endpoint residuals encode physical coordinates. Decomposing endpoint–anchor updates isolates a shared solver adaptation and a signed physical direction. This learned direction can align with physical coordinates and support extrapolative rollouts. Reversed-ordering controls test whether these gains arise from physical orientation rather than arbitrary interpolation. Adaptation across physical regimes remai… view at source ↗

**Figure 2.** Figure 2: Calibration-Conditioned Merge framework. A shared family anchor θ0 is trained on support regimes and fine-tuned into low/high endpoint experts. Their endpoint–anchor residuals define a shared adaptation ∆+ and a signed physical direction ∆−. CCM selects a target coordinate αˆ from metadata, scale calibration, or a short rollout prefix, and instantiates one checkpoint ˆθλ = θ0 + ∆+ + ˆα∆− for rollout. rollo… view at source ↗

**Figure 3.** Figure 3: Shared and signed endpoint structure on controlled DiffReact axes. Left: endpoint averaging is effective on merge-friendly axes, indicating a reusable shared component ∆+. Right: the f3 curve is minimized at α = 0, whereas the harder f2 medium-gap setting prefers a nonzero signed family direction ∆− [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Coordinate law and endpoint-line smoothness across PDE families. Panels A, C, and E present a comparison between the normalized physical coordinate and the corresponding diagnosticoracle-derived coordinate. Panels B, D, and F display the normalized excess loss evaluated along the same endpoint-defined coordinate axis. DiffReact and NS2D show strong coordinate alignment, whereas RDB exhibits weaker alignme… view at source ↗

**Figure 5.** Figure 5: Cross-domain evidence atlas. A–B show coordinate alignment; C, RDB prefix selection; D, wrong-sign coordinate control; E, matched-seed independent reruns; F–H, rollout error comparisons; I, auxiliary diagnostic reductions; J, the FNO scale trend. Matched-seed reruns retain the qualitative advantages of both main strategies: metadata-based coordinate selection on DiffReact and prefix-based selection on RDB.… view at source ↗

**Figure 6.** Figure 6: Large-FNO DiffReact f3 α sweep. E.4 DPOT-style backbone validation The DPOT-style line evaluates the same composition rule outside the FNO architecture. The base reaches 0.0447, average merge improves to 0.0391, and CCM-Coord reaches 0.0183, matching the oracle on this small coordinate bank [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: NS2D prefix-calibration budget. E.7 RDB fixed alpha versus conditional composition RDB is the least directly parameterized domain because the free-surface transient is not well described by a monotone scalar metadata coordinate. The main text therefore uses CCM-Prefix for this setting. The selected α varies by test sample and task, using the short prefix to choose the correct side of the endpoint direction… view at source ↗

**Figure 8.** Figure 8: RDB fixed-α frontier versus conditional CCM-Prefix. 1 2 3 4 5 6 7 8 prefix steps 0.1 0.2 0.3 0.4 0.5 Future-only L2 RDB calibration budget Overall Endpoint Worst [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

**Figure 9.** Figure 9: RDB calibration-budget curve. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

**Figure 10.** Figure 10: RDB task-level selected α. E.8 Unified method ablation The unified ablation compares static averaging, best fixed α, wrong fixed α, and conditional CCM on NS2D and RDB. It measures whether conditioning changes the selected point on the merge line beyond what one fixed interpolation coefficient can provide. Conditioning matters most when one fixed α cannot serve all regimes [PITH_FULL_IMAGE:figures/full_f… view at source ↗

**Figure 11.** Figure 11: Late-time DiffReact f2 medium-gap visualization. Rows correspond to four mediumgap Du tasks for the same seed. Columns show late rollout frames from t = 3.50 to t = 5.00; the u channel is shown for compactness. MG-low nu=4e-5 t=3.55 t=3.90 t=4.25 t=4.60 t=5.00 MG-high nu=2.2e-4 NS2D viscosity family, seed 100128 Vorticity snapshots [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗

**Figure 12.** Figure 12: Late-time NS2D medium-gap visualization. Rows correspond to medium-gap low/high viscosity tasks for the same seed. Columns show late rollout frames from t = 3.55 to t = 5.00. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗

**Figure 13.** Figure 13: Late-time RDB medium-gap visualization. Rows correspond to four medium-gap height tasks for the same seed. Columns show late rollout frames from t = 0.71 to t = 1.00. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_13.png] view at source ↗

read the original abstract

Recent advances in neural operators have made partial differential equation (PDE) surrogate modeling increasingly scalable and transferable through large-scale pretraining and in-context adaptation. However, after a shared operator is fine-tuned to multiple regimes within a continuous physical family, it remains unclear whether the resulting weight-space updates merely form isolated regime experts or reveal reusable physical structure. Starting from a shared family anchor, we fine-tune low- and high-regime endpoint experts and show that their updates can be separated into a family-shared adaptation and a direction aligned with the underlying physical parameter. This separation reinterprets endpoint experts as finite-difference probes of a local physical direction in weight space, explaining why static averaging can interpolate between regimes but attenuates endpoint-specific physics. Building on this perspective, we propose Calibration-Conditioned Merge (CCM), a post-hoc coordinate readout method for composing neural PDE experts along this physical direction. Given physical metadata, a calibrated coordinate mapping, or a short observed rollout prefix, CCM infers the target composition coordinate and deploys a single merged checkpoint for the remaining rollout. We evaluate CCM on the reaction--diffusion system, viscosity-parameterized two-dimensional Navier--Stokes equations, and radial dam-break dynamics. Across these benchmarks, CCM achieves its strongest gains in extrapolative regimes, reducing out-of-distribution rollout error relative to the family anchor by 54.2%, 42.8%, and 13.8%, respectively. Further experiments across FNO scales, a DPOT-style backbone, and ablations confirm that endpoint fine-tuning is not arbitrary checkpoint drift, but reveals a calibratable physical direction for training-free transfer across PDE regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Endpoint fine-tuning splits weight updates into a shared part plus a physical direction that CCM uses for training-free regime transfer on neural PDEs.

read the letter

Dear colleague, the core finding is that fine-tuning a shared neural operator to the low and high ends of a physical parameter range produces weight updates that separate cleanly into family-wide adaptation plus a direction aligned with the parameter itself. They treat the endpoints as finite-difference probes and build CCM on top to read off a composition coordinate from metadata or a short rollout prefix, then deploy one merged checkpoint for the rest of the simulation. This gives measurable error drops in extrapolation on the three benchmarks they report: 54% on reaction-diffusion, 43% on viscosity-parameterized Navier-Stokes, and 14% on radial dam-break, relative to the family anchor. The ablations across FNO scales and a DPOT-style backbone add some reassurance that the pattern is not backbone-specific. What is actually new is the explicit framing of the separation as a reusable physical direction rather than just another model-merging trick, plus the post-hoc readout that avoids retraining. The results look solid enough on the reported tasks to justify the claim that endpoint fine-tuning is revealing structure instead of arbitrary drift. The main soft spot is that the abstract gives no controls on whether the direction vector stays stable when you vary learning rate, optimizer, step count, or anchor choice. If it rotates or shrinks under those changes, the alignment is more procedure-dependent than intrinsic. There are also no error bars or data-split details visible, which makes the quantitative gains harder to weigh. This is worth a serious referee for anyone working on parameter-continuous neural operators or cheap regime sweeps in fluids and reaction systems. The idea is clean, the experiments give a concrete starting point, and the central separation claim is testable even if it needs tighter validation on the optimization side. I would send it to review rather than desk-reject.

Referee Report

2 major / 1 minor

Summary. The paper claims that fine-tuning a shared neural operator to low- and high-regime endpoints within a continuous physical family separates weight updates into a family-shared adaptation component and a direction aligned with the underlying physical parameter. This reinterpretation motivates the Calibration-Conditioned Merge (CCM) method, which uses physical metadata, a calibrated coordinate mapping, or a short rollout prefix to infer a composition coordinate and deploy a merged checkpoint for training-free transfer. Evaluations on reaction-diffusion, viscosity-parameterized 2D Navier-Stokes, and radial dam-break dynamics report out-of-distribution rollout error reductions of 54.2%, 42.8%, and 13.8% relative to the family anchor, with further ablations across FNO scales and a DPOT-style backbone.

Significance. If the claimed physical direction proves robust rather than procedure-dependent, the work could offer a principled mechanism for composing neural PDE experts along continuous physical parameters, enabling efficient extrapolation without retraining. The empirical gains in extrapolative regimes across three distinct benchmarks indicate potential practical value for scalable surrogate modeling, though the absence of controls for optimization details limits current confidence in the separation's stability.

major comments (2)

[Abstract] The separation of updates into family-shared adaptation and physical direction is performed by taking the difference between low- and high-regime endpoint fine-tunings relative to the shared anchor (as described in the abstract). This vector is then treated as aligned with the physical parameter for CCM composition. For the claim to hold, the direction must be dominated by the parameter change and insensitive to optimization details, yet the abstract reports ablations only across backbones and scales with no explicit controls varying learning rate, step count, optimizer, or anchor perturbation.
[Experiments] The reported error reductions (54.2%, 42.8%, 13.8%) are presented without error bars, details on data splits, or ablation controls for the CCM method versus the family anchor. This makes it difficult to assess whether the gains are statistically reliable or sensitive to the specific fine-tuning trajectories used to discover the direction.

minor comments (1)

[Abstract] The abstract refers to 'a calibrated coordinate mapping' and 'short observed rollout prefix' for inferring the composition coordinate, but the precise functional form of the readout and how it is fitted from metadata or prefix data is not equationally specified in the provided summary.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the robustness of the claimed physical direction and the statistical presentation of results. We address each major comment below and commit to revisions that strengthen the evidence without altering the core claims.

read point-by-point responses

Referee: [Abstract] The separation of updates into family-shared adaptation and physical direction is performed by taking the difference between low- and high-regime endpoint fine-tunings relative to the shared anchor (as described in the abstract). This vector is then treated as aligned with the physical parameter for CCM composition. For the claim to hold, the direction must be dominated by the parameter change and insensitive to optimization details, yet the abstract reports ablations only across backbones and scales with no explicit controls varying learning rate, step count, optimizer, or anchor perturbation.

Authors: We agree that explicit controls for optimization details would strengthen the interpretation that the discovered direction is dominated by the physical parameter rather than fine-tuning procedure. The existing ablations across FNO scales and a DPOT-style backbone already show consistency, but they do not vary learning rate, step count, optimizer, or anchor initialization. In the revised manuscript we will add a dedicated ablation subsection that systematically varies these factors on at least one benchmark and reports the resulting direction stability (measured by cosine similarity to the original direction and downstream CCM error). revision: yes
Referee: [Experiments] The reported error reductions (54.2%, 42.8%, 13.8%) are presented without error bars, details on data splits, or ablation controls for the CCM method versus the family anchor. This makes it difficult to assess whether the gains are statistically reliable or sensitive to the specific fine-tuning trajectories used to discover the direction.

Authors: We acknowledge that the current manuscript lacks error bars, explicit data-split descriptions, and direct statistical comparisons of CCM against the family anchor. In the revision we will (i) recompute all reported rollout errors over at least five independent random seeds and include standard-error bars, (ii) add a table specifying train/validation/test splits and trajectory counts for each benchmark, and (iii) include an ablation table that directly contrasts CCM against the anchor with paired statistical tests (e.g., Wilcoxon signed-rank) to quantify significance of the observed gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical discovery of weight-space directions remains self-contained

full rationale

The paper presents an empirical procedure: fine-tune endpoint experts from a shared anchor, observe that their difference vector separates family-shared adaptation from a direction that empirically aligns with the physical parameter, then deploy CCM as a post-hoc linear composition along that observed vector. No equation or derivation reduces the claimed physical direction to a fitted quantity defined from the same evaluation data by construction; the alignment is tested via rollout error on held-out and extrapolative regimes rather than being tautological. Self-citations are not load-bearing for the central claim, and no ansatz or uniqueness theorem is smuggled in to force the result. The method is therefore a coordinate readout on an independently observed vector, not a self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text. The physical direction is presented as discovered rather than postulated.

pith-pipeline@v0.9.0 · 5623 in / 1197 out tokens · 37317 ms · 2026-05-15T01:59:33.617795+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

endpoint–anchor residuals define a shared adaptation Δ+ and a signed physical direction Δ− ... θ(α) = θ0 + Δ+ + αΔ−
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

finite-difference probes of a local physical direction in weight space

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 5 internal anchors

[1]

Fourier Neural Operator for Parametric Partial Differential Equations

Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differen- tial equations.arXiv preprint arXiv:2010.08895, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[2]

Neural operators for accelerating scientific simulations and design

Kamyar Azizzadenesheli, Nikola Kovachki, Zongyi Li, Miguel Liu-Schiaffini, Jean Kossaifi, and Anima Anandkumar. Neural operators for accelerating scientific simulations and design. Nature Reviews Physics, 6(5):320–328, 2024

work page 2024
[3]

Gnot: A general neural operator transformer for operator learning

Zhongkai Hao, Zhengyi Wang, Hang Su, Chengyang Ying, Yinpeng Dong, Songming Liu, Ze Cheng, Jian Song, and Jun Zhu. Gnot: A general neural operator transformer for operator learning. InInternational conference on machine learning, pages 12556–12569. PMLR, 2023

work page 2023
[4]

Laplace neural operator for solving differential equations.Nature Machine Intelligence, 6(6):631–640, 2024

Qianying Cao, Somdatta Goswami, and George Em Karniadakis. Laplace neural operator for solving differential equations.Nature Machine Intelligence, 6(6):631–640, 2024

work page 2024
[5]

Neural Operator: Graph Kernel Network for Partial Differential Equations

Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Graph kernel network for partial differential equations.arXiv preprint arXiv:2003.03485, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2003
[6]

Alias-free mamba neural operator.Advances in Neural Information Processing Systems, 37:52962–52995, 2024

Jianwei Zheng, Wei Li, Ni Xu, Junwei Zhu, Xiaoxu Lin, and Xiaoqin Zhang. Alias-free mamba neural operator.Advances in Neural Information Processing Systems, 37:52962–52995, 2024

work page 2024
[7]

Aerogto: An efficient graph-transformer operator for learning large-scale aerodynamics of 3d vehicle geometries

Pengwei Liu, Pengkai Wang, Xingyu Ren, Hangjie Yuan, Zhongkai Hao, Chao Xu, Shengze Cai, and Dong Ni. Aerogto: An efficient graph-transformer operator for learning large-scale aerodynamics of 3d vehicle geometries. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 18924–18932, 2025

work page 2025
[8]

An efficient graph-transformer operator for learning physical dynamics with manifolds embedding.arXiv preprint arXiv:2512.10227, 2025

Pengwei Liu, Xingyu Ren, Pengkai Wang, Hangjie Yuan, Zhongkai Hao, Guanyu Chen, Chao Xu, Dong Ni, and Shengze Cai. An efficient graph-transformer operator for learning physical dynamics with manifolds embedding.arXiv preprint arXiv:2512.10227, 2025

work page arXiv 2025
[9]

Foundation neural operators: A survey on pretraining methods, the data ecosystem, and efficient adaptation

Xingyu Ren, Pengkai Wang, Pengwei Liu, Xihang Yue, Huanshuo Dong, Zhenxin Huang, Zhongkai Hao, Ziqian Hu, Zhen Huang, Yian Wang, et al. Foundation neural operators: A survey on pretraining methods, the data ecosystem, and efficient adaptation. 2026

work page 2026
[10]

Uncertainty- informed meta pseudo labeling for surrogate modeling with limited labeled data

Xingyu Ren, Pengwei Liu, Pengkai Wang, Guanyu Chen, Qinxin Wu, and Dong Ni. Uncertainty- informed meta pseudo labeling for surrogate modeling with limited labeled data. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[11]

Fourcastnet: Accel- erating global high-resolution weather forecasting using adaptive fourier neural operators

Thorsten Kurth, Shashank Subramanian, Peter Harrington, Jaideep Pathak, Morteza Mardani, David Hall, Andrea Miele, Karthik Kashinath, and Anima Anandkumar. Fourcastnet: Accel- erating global high-resolution weather forecasting using adaptive fourier neural operators. In Proceedings of the platform for advanced scientific computing conference, pages 1–11, 2023

work page 2023
[12]

Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023

Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, et al. Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023

work page 2023
[13]

Learning nonlinear operators via deeponet based on the universal approximation theorem of operators

Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George Em Karniadakis. Learning nonlinear operators via deeponet based on the universal approximation theorem of operators. Nature machine intelligence, 3(3):218–229, 2021

work page 2021
[14]

Neural operator: Learning maps between function spaces with applications to pdes.Journal of Machine Learning Research, 24(89):1–97, 2023

Nikola Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Learning maps between function spaces with applications to pdes.Journal of Machine Learning Research, 24(89):1–97, 2023

work page 2023
[15]

Dpot: auto-regressive denoising operator transformer for large-scale pde pre-training

Zhongkai Hao, Chang Su, Songming Liu, Julius Berner, Chengyang Ying, Hang Su, Anima Anandkumar, Jian Song, and Jun Zhu. Dpot: auto-regressive denoising operator transformer for large-scale pde pre-training. InProceedings of the 41st International Conference on Machine Learning, pages 17616–17635, 2024. 10

work page 2024
[16]

Poseidon: Efficient foundation models for pdes

Maximilian Herde, Bogdan Raoni ´c, Tobias Rohner, Roger Käppeli, Roberto Molinaro, Em- manuel De Bezenac, and Siddhartha Mishra. Poseidon: Efficient foundation models for pdes. Advances in Neural Information Processing Systems, 37:72525–72624, 2024

work page 2024
[17]

Pretraining codomain attention neural operators for solving multiphysics pdes.Advances in Neural Information Processing Systems, 37:104035–104064, 2024

Ashiqur Rahman, Robert J George, Mogab Elleithy, Daniel Leibovici, Zongyi Li, Boris Bonev, Colin White, Julius Berner, Raymond A Yeh, Jean Kossaifi, et al. Pretraining codomain attention neural operators for solving multiphysics pdes.Advances in Neural Information Processing Systems, 37:104035–104064, 2024

work page 2024
[18]

Multiple physics pretraining for spatiotemporal surrogate models.Advances in Neural Information Processing Systems, 37:119301–119335, 2024

Michael McCabe, Bruno Régaldo-Saint Blancard, Liam Parker, Ruben Ohana, Miles Cranmer, Alberto Bietti, Michael Eickenberg, Siavash Golkar, Geraud Krawezik, Francois Lanusse, et al. Multiple physics pretraining for spatiotemporal surrogate models.Advances in Neural Information Processing Systems, 37:119301–119335, 2024

work page 2024
[19]

Walrus: A cross-domain foun- dation model for continuum dynamics.arXiv preprint arXiv:2511.15684, 2025

Michael McCabe, Payel Mukhopadhyay, Tanya Marwah, Bruno Regaldo-Saint Blancard, Fran- cois Rozet, Cristiana Diaconu, Lucas Meyer, Kaze WK Wong, Hadi Sotoudeh, Alberto Bietti, et al. Walrus: A cross-domain foundation model for continuum dynamics.arXiv preprint arXiv:2511.15684, 2025

work page arXiv 2025
[20]

Neural general circulation models for weather and climate.Nature, 632(8027):1060–1066, 2024

Dmitrii Kochkov, Janni Yuval, Ian Langmore, Peter Norgaard, Jamie Smith, Griffin Mooers, Milan Klöwer, James Lottes, Stephan Rasp, Peter Düben, et al. Neural general circulation models for weather and climate.Nature, 632(8027):1060–1066, 2024

work page 2024
[21]

A foundation model for the earth system.Nature, 641(8065):1180–1187, 2025

Cristian Bodnar, Wessel P Bruinsma, Ana Lucic, Megan Stanley, Anna Allen, Johannes Brand- stetter, Patrick Garvan, Maik Riechert, Jonathan A Weyn, Haiyu Dong, et al. A foundation model for the earth system.Nature, 641(8065):1180–1187, 2025

work page 2025
[22]

Pdebench: An extensive benchmark for scientific machine learning.Advances in neural information processing systems, 35:1596–1611, 2022

Makoto Takamoto, Timothy Praditia, Raphael Leiteritz, Daniel MacKinlay, Francesco Alesiani, Dirk Pflüger, and Mathias Niepert. Pdebench: An extensive benchmark for scientific machine learning.Advances in neural information processing systems, 35:1596–1611, 2022

work page 2022
[23]

The well: a large-scale collection of diverse physics simulations for machine learning.Advances in Neural Information Processing Systems, 37:44989–45037, 2024

Ruben Ohana, Michael McCabe, Lucas Meyer, Rudy Morel, Fruzsina J Agocs, Miguel Beneitez, Marsha Berger, Blakesley Burkhart, Stuart B Dalziel, Drummond B Fielding, et al. The well: a large-scale collection of diverse physics simulations for machine learning.Advances in Neural Information Processing Systems, 37:44989–45037, 2024

work page 2024
[24]

Apebench: A benchmark for autoregressive neural emulators of pdes.Advances in Neural Information Processing Systems, 37:120252–120310, 2024

Felix Koehler, Simon Niedermayr, Rüdiger Westermann, and Nils Thuerey. Apebench: A benchmark for autoregressive neural emulators of pdes.Advances in Neural Information Processing Systems, 37:120252–120310, 2024

work page 2024
[25]

Learning mesh-based simulation with graph networks.arXiv preprint arXiv:2010.03409, 2020

Tobias Pfaff, Meire Fortunato, Alvaro Sanchez-Gonzalez, and Peter W Battaglia. Learning mesh-based simulation with graph networks.arXiv preprint arXiv:2010.03409, 2020

work page arXiv 2010
[26]

Magnet: Mesh agnostic neural pde solver.Advances in Neural Information Processing Systems, 35:31972– 31985, 2022

Oussama Boussif, Yoshua Bengio, Loubna Benabbou, and Dan Assouline. Magnet: Mesh agnostic neural pde solver.Advances in Neural Information Processing Systems, 35:31972– 31985, 2022

work page 2022
[27]

Deep transfer operator learning for partial differential equations under conditional shift.Nature Machine Intelligence, 4(12):1155–1164, 2022

Somdatta Goswami, Katiana Kontolati, Michael D Shields, and George Em Karniadakis. Deep transfer operator learning for partial differential equations under conditional shift.Nature Machine Intelligence, 4(12):1155–1164, 2022

work page 2022
[28]

Fourier neural operator with learned deformations for pdes on general geometries.Journal of Machine Learning Research, 24(388):1–26, 2023

Zongyi Li, Daniel Zhengyu Huang, Burigede Liu, and Anima Anandkumar. Fourier neural operator with learned deformations for pdes on general geometries.Journal of Machine Learning Research, 24(388):1–26, 2023

work page 2023
[29]

Geometry-informed neural operator for large-scale 3d pdes.Advances in Neural Information Processing Systems, 36:35836–35854, 2023

Zongyi Li, Nikola Kovachki, Chris Choy, Boyi Li, Jean Kossaifi, Shourya Otta, Moham- mad Amin Nabian, Maximilian Stadler, Christian Hundt, Kamyar Azizzadenesheli, et al. Geometry-informed neural operator for large-scale 3d pdes.Advances in Neural Information Processing Systems, 36:35836–35854, 2023

work page 2023
[30]

Group equivariant fourier neural operators for partial differential equations.arXiv preprint arXiv:2306.05697, 2023

Jacob Helwig, Xuan Zhang, Cong Fu, Jerry Kurtin, Stephan Wojtowytsch, and Shuiwang Ji. Group equivariant fourier neural operators for partial differential equations.arXiv preprint arXiv:2306.05697, 2023. 11

work page arXiv 2023
[31]

Latent neural operator for solving forward and inverse pde problems.Advances in Neural Information Processing Systems, 37:33085–33107, 2024

Tian Wang and Chuang Wang. Latent neural operator for solving forward and inverse pde problems.Advances in Neural Information Processing Systems, 37:33085–33107, 2024

work page 2024
[32]

A scalable framework for learning the geometry-dependent solution operators of partial differential equations.Nature computational science, 4(12):928–940, 2024

Minglang Yin, Nicolas Charon, Ryan Brody, Lu Lu, Natalia Trayanova, and Mauro Maggioni. A scalable framework for learning the geometry-dependent solution operators of partial differential equations.Nature computational science, 4(12):928–940, 2024

work page 2024
[33]

Diffeomorphism neural operator for various domains and parameters of partial differential equations.Communications Physics, 8(1):15, 2025

Zhiwei Zhao, Changqing Liu, Yingguang Li, Zhibin Chen, and Xu Liu. Diffeomorphism neural operator for various domains and parameters of partial differential equations.Communications Physics, 8(1):15, 2025

work page 2025
[34]

Transolver: A Fast Transformer Solver for PDEs on General Geometries

Haixu Wu, Huakun Luo, Haowen Wang, Jianmin Wang, and Mingsheng Long. Transolver: A fast transformer solver for pdes on general geometries.arXiv preprint arXiv:2402.02366, 2024

work page internal anchor Pith review arXiv 2024
[35]

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InInternational conference on machine learning, pages 23965–23998. P...

work page 2022
[36]

Merging models with fisher-weighted averaging

Michael S Matena and Colin A Raffel. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716, 2022

work page 2022
[37]

Editing Models with Task Arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[38]

Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023

Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models.Advances in neural information processing systems, 36:7093–7115, 2023

work page 2023
[39]

Adamerging: Adaptive model merging for multi-task learning.arXiv preprint arXiv:2310.02575,

Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging: Adaptive model merging for multi-task learning.arXiv preprint arXiv:2310.02575, 2023

work page arXiv 2023
[40]

Language models are super mario: Absorbing abilities from homologous models as a free lunch

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InForty-first International Conference on Machine Learning, 2024

work page 2024
[41]

Model Merging Scaling Laws in Large Language Models

Yuanyi Wang, Yanggan Gu, Yiming Zhang, Qi Zhou, Zhaoyi Yan, Congkai Xie, Xinyao Wang, Jianbo Yuan, and Hongxia Yang. Model merging scaling laws in large language models.arXiv preprint arXiv:2509.24244, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Infifpo: Implicit model fusion via preference optimization in large language models.arXiv preprint arXiv:2505.13878, 2025

Yanggan Gu, Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Fei Wu, and Hongxia Yang. Infifpo: Implicit model fusion via preference optimization in large language models.arXiv preprint arXiv:2505.13878, 2025

work page arXiv 2025
[43]

Infigfusion: Graph-on-logits distillation via efficient gromov-wasserstein for model fusion

Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Yanggan Gu, Fei Wu, and Hongxia Yang. Infigfusion: Graph-on-logits distillation via efficient gromov-wasserstein for model fusion. arXiv preprint arXiv:2505.13893, 2025

work page arXiv 2025
[44]

Multiwavelet-based operator learning for differential equations.Advances in neural information processing systems, 34:24048–24062, 2021

Gaurav Gupta, Xiongye Xiao, and Paul Bogdan. Multiwavelet-based operator learning for differential equations.Advances in neural information processing systems, 34:24048–24062, 2021

work page 2021
[45]

Choose a transformer: Fourier or galerkin.Advances in neural information processing systems, 34:24924–24940, 2021

Shuhao Cao. Choose a transformer: Fourier or galerkin.Advances in neural information processing systems, 34:24924–24940, 2021

work page 2021
[46]

Factorized fourier neural operators.arXiv preprint arXiv:2111.13802, 2021

Alasdair Tran, Alexander Mathews, Lexing Xie, and Cheng Soon Ong. Factorized fourier neural operators.arXiv preprint arXiv:2111.13802, 2021

work page arXiv 2021
[47]

Message passing neural pde solvers

Johannes Brandstetter, Daniel Worrall, and Max Welling. Message passing neural pde solvers. arXiv preprint arXiv:2202.03376, 2022. 12

work page arXiv 2022
[48]

Mgno: Efficient parameterization of linear operators via multigrid.arXiv preprint arXiv:2310.19809, 2023

Juncai He, Xinliang Liu, and Jinchao Xu. Mgno: Efficient parameterization of linear operators via multigrid.arXiv preprint arXiv:2310.19809, 2023

work page arXiv 2023
[49]

Accurate medium-range global weather forecasting with 3d neural networks.Nature, 619(7970):533–538, 2023

Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. Accurate medium-range global weather forecasting with 3d neural networks.Nature, 619(7970):533–538, 2023

work page 2023
[50]

End- to-end data-driven weather prediction.Nature, 641(8065):1172–1179, 2025

Anna Allen, Stratis Markou, Will Tebbutt, James Requeima, Wessel P Bruinsma, Tom R Andersson, Michael Herzog, Nicholas D Lane, Matthew Chantry, J Scott Hosking, et al. End- to-end data-driven weather prediction.Nature, 641(8065):1172–1179, 2025

work page 2025
[51]

Mpg: An efficient multi-scale point- based gnn for non-uniform meshes

Qinxin Wu, Pengwei Liu, Xingyu Ren, and Dong Ni. Mpg: An efficient multi-scale point- based gnn for non-uniform meshes. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 3–18. Springer, 2025

work page 2025
[52]

In-context operator learning with data prompts for differential equation problems.Proceedings of the National Academy of Sciences, 120(39):e2310142120, 2023

Liu Yang, Siting Liu, Tingwei Meng, and Stanley J Osher. In-context operator learning with data prompts for differential equation problems.Proceedings of the National Academy of Sciences, 120(39):e2310142120, 2023

work page 2023
[53]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022
[54]

Model fusion via optimal transport.Advances in Neural Information Processing Systems, 33:22045–22055, 2020

Sidak Pal Singh and Martin Jaggi. Model fusion via optimal transport.Advances in Neural Information Processing Systems, 33:22045–22055, 2020

work page 2020
[55]

K., Hayase, J., and Srinivasa, S

Samuel K Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries.arXiv preprint arXiv:2209.04836, 2022

work page arXiv 2022
[56]

Task arithmetic in the tangent space: Improved editing of pre-trained models.Advances in Neural Information Processing Systems, 36:66727–66754, 2023

Guillermo Ortiz-Jimenez, Alessandro Favero, and Pascal Frossard. Task arithmetic in the tangent space: Improved editing of pre-trained models.Advances in Neural Information Processing Systems, 36:66727–66754, 2023

work page 2023
[57]

Emr-merging: Tuning-free high-performance model merging.Advances in Neural Information Processing Systems, 37:122741–122769, 2024

Chenyu Huang, Peng Ye, Tao Chen, Tong He, Xiangyu Yue, and Wanli Ouyang. Emr-merging: Tuning-free high-performance model merging.Advances in Neural Information Processing Systems, 37:122741–122769, 2024

work page 2024
[58]

Calm: Consensus-aware localized merging for multi-task learning.arXiv preprint arXiv:2506.13406, 2025

Kunda Yan, Min Zhang, Sen Cui, Zikun Qu, Bo Jiang, Feng Liu, and Changshui Zhang. Calm: Consensus-aware localized merging for multi-task learning.arXiv preprint arXiv:2506.13406, 2025

work page arXiv 2025
[59]

Map: Low-compute model merging with amortized pareto fronts via quadratic approximation.arXiv preprint arXiv:2406.07529, 2024

Lu Li, Tianyu Zhang, Zhiqi Bu, Suyuchen Wang, Huan He, Jie Fu, Yonghui Wu, Jiang Bian, Yong Chen, and Yoshua Bengio. Map: Low-compute model merging with amortized pareto fronts via quadratic approximation.arXiv preprint arXiv:2406.07529, 2024. 13 Appendix Contents A Theory for Controlled PDE Merge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....

work page arXiv 2024
[60]

there is a solutionu(a, µ)withR(u(a, µ), a, µ) = 0; 3.D uR(u(a, µ), a, µ) :U → Zis invertible with uniformly bounded inverse

work page
[61]

Complete prefix L2

the first and second derivatives of R entering the sensitivity equations are bounded by a ρ-square-integrable envelope. 18 Thenµ7→ S µ ∈ H ρ isC 2. Moreover, sup µ ∂2 µµSµ ρ <∞. Consequently, forµ(s) =µ c +hs, d2 ds2 Sµ(s) =h 2∂2 µµSµ(s), K S ≲|h| 2. Proof. For each admissible input a, the implicit function theorem in Banach spaces ensures the local C2-re...

work page