PDEInvBench: A Comprehensive Dataset and Design Space Exploration of Neural Networks for PDE Inverse Problems

Aditi S. Krishnapriyan; Divyam Goel; Nithin Chalapathi; Sanjeev Raja

arxiv: 2605.25353 · v2 · pith:LLE7VJDWnew · submitted 2026-05-25 · 💻 cs.LG · cs.CV· physics.comp-ph

PDEInvBench: A Comprehensive Dataset and Design Space Exploration of Neural Networks for PDE Inverse Problems

Divyam Goel , Nithin Chalapathi , Sanjeev Raja , Aditi S. Krishnapriyan This is my paper

Pith reviewed 2026-06-29 22:39 UTC · model grok-4.3

classification 💻 cs.LG cs.CVphysics.comp-ph

keywords PDE inverse problemsneural networksbenchmark datasetparameter estimationtest-time trainingphysics-informed learningout-of-distribution evaluationdesign space exploration

0 comments

The pith

Neural networks recover PDE parameters most accurately with two-stage training on parameters then PDE residual fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates PDEInvBench, a dataset of numerical simulations spanning time-dependent and time-independent PDEs, to enable systematic study of neural networks that map observed solution fields back to the underlying physical parameters. It tests three axes of design: optimization procedures that mix supervised, self-supervised, and test-time objectives; architectural choices including conditioning and inductive biases; and scaling in model size and data volume. Experiments identify a consistent ranking in which an initial supervised phase on parameters followed by test-time adjustment using the PDE residual outperforms other schedules, derivative-augmented inputs raise accuracy across architectures, and broadening the variety of initial conditions in training data improves results more than widening the parameter ranges. These findings matter because inverse parameter estimation from field data is a recurring task in physics, engineering, and scientific computing where labeled parameter data is scarce.

Core claim

The central claim is that a two-stage procedure—supervised pre-training on PDE parameters followed by test-time fine-tuning that minimizes the PDE residual—combined with derivative features as inputs and training data that emphasizes initial-condition diversity, produces the highest accuracy for neural networks solving PDE inverse problems on both in-distribution and out-of-distribution splits of the new benchmark.

What carries the argument

PDEInvBench dataset of simulated solution fields paired with parameters, equipped with in- and out-of-distribution evaluation splits, used to benchmark neural network performance across optimization, representation, and scaling choices.

If this is right

Models trained under the two-stage schedule should be adopted as the default baseline for neural PDE parameter estimation.
Input channels should routinely include spatial and temporal derivatives of the observed fields.
Data collection efforts should allocate more resources to sampling varied initial conditions than to sampling wider parameter intervals.
Test-time residual fine-tuning can be expected to close a measurable fraction of the gap between supervised performance and the theoretical optimum.
Scaling laws for these tasks will show larger returns from data diversity than from parameter-range expansion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-stage recipe may transfer to inverse problems outside PDEs, such as recovering coefficients in integral equations or stochastic processes.
The benchmark could be used to test whether the observed ranking of methods persists when measurement noise levels or missing data patterns match those in actual experiments.
Practitioners could combine the dataset with transfer-learning protocols to initialize models for new PDE families without regenerating large simulation libraries.

Load-bearing premise

The numerical simulations that generate the dataset correctly capture the true solution behavior of the PDEs under the selected parameters, initial conditions, and boundary conditions.

What would settle it

A controlled experiment on real laboratory or field measurements of a PDE-governed system in which the recommended two-stage training plus derivative inputs fails to outperform plain supervised training on held-out parameter recovery error.

Figures

Figures reproduced from arXiv: 2605.25353 by Aditi S. Krishnapriyan, Divyam Goel, Nithin Chalapathi, Sanjeev Raja.

**Figure 1.** Figure 1: Overview of design investigation of neural networks for PDE inverse problems. (A) We consider the inverse problem setting in which a neural network learns to map from PDE solution fields u(ϕ) to predicted PDE parameters ϕˆ, using a dataset D consisting of parameter-solution pairs as supervision. (B) Our investigation is split across three key design axes: optimization procedures, problem representation, an… view at source ↗

**Figure 2.** Figure 2: Optimization approaches for NOs in PDE inverse problems: (A–C) The performance of FNO on purely data-driven supervision versus self-supervision using only the PDE residual. For virtually all systems and evaluation settings, purely data-driven supervision consistently outperforms self-supervision using the PDE residual. As expected, performance degrades on the OOD Non-Extreme and OOD Extreme splits. (D–E) C… view at source ↗

**Figure 3.** Figure 3: Impact of problem representation on PDE inverse problem performance. (A–C) Comparison of architectural inductive biases (FNO, ResNet, scOT, DeepONet) across evaluation splits. FNO generally outperforms ResNet, scOT and DeepONet on time-dependent PDEs, particularly in In-Distribution and Out-of-Distribution (Non-Extreme) regimes. Results are also shown in tabular form in Tables 6, 7, and 8. (D–F) Effect of … view at source ↗

**Figure 4.** Figure 4: Dataset scaling with the FNO architecture: (A–C) Effect of initial condition scaling on FNO across evaluated systems. Increasing the percent of initial conditions used for training improves performance across all systems and evaluation settings. (D–F) FNO’s sensitivity to initial condition scaling (while using 100% of available PDE parameters) versus scaling the number of generated PDE parameters (while us… view at source ↗

**Figure 5.** Figure 5: Energy spectra convergence of 2D Navier-Stokes Forced. (Left) Energy spectra for ν = 5e −3 . Region a: inertial range, with rough adherence to the k −5 Kolmogorov power law. Region b: dissipative region where energy leaves the system. (Right) Energy spectra for ν = 1e −5 . (Both) The blue bounding box highlights convergence behavior, characterized by a steep drop-off in energy spectra into minor oscillatio… view at source ↗

**Figure 6.** Figure 6: Parameter partitioning for 2D RD. The parameter space for 2D RD forms a cube with each dimension corresponding to possible values along a parameter. The inner cube (orange) with each edge covering the middle 16% of parameter values corresponds to the OOD (Non-Extreme) split. The middle cube (green) with each edge covering 32 % of parameters corresponds to the ID setting. The corner cubes (red) with edge le… view at source ↗

**Figure 7.** Figure 7: Effect of PDE residual loss weights in the PINNs loss. Relative error from the best performing models on the validation set across three scenarios: ID (left), OOD Non-Extreme (middle), and OOD Extreme (right). The x-axis represents PDE residual term weight ranging from 0 to 1 on a logarithmic scale. Joint training with the PDE residual offers no significant improvements over direct parameter supervision, a… view at source ↗

**Figure 8.** Figure 8: Effect of test time tailoring with different anchor loss weights per batch. Using an anchor loss weight [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

**Figure 9.** Figure 9: Comparing per-batch vs per-batch test time training. Both settings generally perform the same across systems and evaluation settings. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: Evaluating test-time training with varying quantities of initial conditions during training time and [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

**Figure 11.** Figure 11: Varying temporal conditioning frames. Comparison of inverse problem performance with increasing number of temporal frames across evaluation settings for all time dependent settings. There is no consistent trend in scaling the conditioning PDE frames and inverse problem performance for FNO. The lack of consistent improvement with more frames may reflect redundancy in temporal information beyond a certain p… view at source ↗

**Figure 12.** Figure 12: Inverse Problem Performance as a function of model size. Performance curves for different PDE [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗

**Figure 13.** Figure 13: Effect of initial condition scaling across architectures Evaluating data efficiency of different architectures to scaling the number of initial conditions. Increasing the total number of initial conditions during training improves performance for all architectures, with FNO being the most data efficient. Increasing the number of parameters generally leads to better performance in all evaluation settings f… view at source ↗

**Figure 14.** Figure 14: Effect of scaling the total number of generated PDE parameters. Evaluating inverse problem performance on different quantities of available data by scaling the total number of generated parameter settings of the PDEs. Increasing the number of training trajectories along generated PDE parameters improves test time performance on unseen parameters. NS Unforced NS ForcedKdV RD - k RD - Du DF 0 1 2 3 Negative… view at source ↗

**Figure 15.** Figure 15: Effect of PDE parameter scaling across architectures Evaluating data efficiency of different architectures to scaling the number of PDE parameters. Increasing the total number of initial conditions during training improves performance for all architectures with FNO being the most data efficient. the temporal horizon during training does not inherently improve generalization capability when estimating PDE … view at source ↗

**Figure 16.** Figure 16: Effect of increasing the ground truth time horizon. Evaluating Inverse problem performance on different quantities of available data by scaling the total time horizon of the training solution fields. Increasing the total time horizon of training trajectories improves performance on held out future time frames. In the reference simulations, more viscous flows (higher ν / lower Reynolds number) exhibit an e… view at source ↗

**Figure 17.** Figure 17: Energy spectra of predicted Forced 2D Navier Stokes vs. reference solutions. Energy spectra corresponding to predictions are in solid lines, while references are dashed. Armed with the insight that removing partial derivative conditioning improves the robustness of the models to degradation ( [PITH_FULL_IMAGE:figures/full_fig_p034_17.png] view at source ↗

**Figure 18.** Figure 18: Impact of noisy inputs on models trained on non-noisy solution fields. Evaluating models trained on clean inputs across all design axes under varying levels of salt-and-pepper noise and Butterworth filtering. Models trained without partial-derivative inputs exhibit improved robustness to these degradations. learned inverse mapping is insensitive to the exact magnitude of spectral corruption, provided the … view at source ↗

**Figure 20.** Figure 20: As expected, higher drop probabilities [PITH_FULL_IMAGE:figures/full_fig_p035_20.png] view at source ↗

**Figure 19.** Figure 19: Evaluation of FNO robustness after training under observational degradation. Heatmaps show relative error as a function of train-time (rows) and test-time (columns) degradation levels for salt-and-pepper (top) and Butterworth (bottom) corruption across three PDE systems. Salt-and-pepper noise induces a strong diagonal structure, indicating sensitivity to mismatch between training and inference corruption … view at source ↗

**Figure 20.** Figure 20: Impact of Non-Uniform Grids on PDE inverse problem performance. Comparison of architectural inductive biases (FNO, ResNet, scOT) to varying degrees of non-uniformity in solution fields D.7 Benchmark Results We provide the relative errors of the various models across all systems and settings in a table format. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_20.png] view at source ↗

read the original abstract

Inverse problems in partial differential equations (PDEs) involve estimating the physical parameters of a system from observed spatiotemporal solution fields. Neural networks are well-suited for PDE parameter estimation due to their capability to model function-to-function space transformations. While existing benchmarks of machine learning methods for PDEs primarily focus on the forward problem, there are no similar comprehensive studies and benchmark datasets on PDE inverse problems, i.e., mapping solution fields to underlying physical parameters. We fill this gap by introducing PDEInvBench, a comprehensive benchmark dataset consisting of numerical simulations for both time-dependent and time-independent PDEs across a wide range of physical behaviors and parameters. Our dataset includes evaluation splits that assess performance in both in-distribution and various out-of-distribution settings. Using our benchmark dataset, we comprehensively explore the design space of neural networks for PDE inverse problems along three key dimensions: (1) optimization procedures, analyzing the role of supervised, self-supervised, and test-time training objectives on performance, (2) problem representations, where we study the value of architectural choices with different inductive biases and various conditioning strategies, and (3) scaling, which we perform with respect to both model and data size. Our experiments reveal several practical insights: 1) neural networks perform best with a two-stage training procedure: initial supervision with PDE parameters followed by test-time fine-tuning using the PDE residual, 2) incorporating PDE derivatives as input features consistently improves accuracy, and 3) increasing the diversity of initial conditions in the training data yields greater performance gains than expanding the range of PDE parameters. We make our dataset and codebase publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PDEInvBench ships a public dataset for PDE inverse problems and some practical training notes, but the simulation quality behind the data is not shown to be solid.

read the letter

PDEInvBench introduces a dataset of numerical PDE simulations aimed at inverse problems, mapping solution fields back to parameters, plus controlled runs on training methods, input features, and data scaling. It also supplies in- and out-of-distribution splits and releases both data and code.

The work is new in its focus on the inverse direction and in testing the three specific claims: two-stage training (supervised then residual fine-tuning) works best, derivative inputs help, and initial-condition variety beats expanding the parameter range. Those are concrete, testable observations that practitioners can check.

The main soft spot is the numerical fidelity of the underlying simulations. The abstract and dataset description give no convergence studies, grid-refinement results, or cross-solver checks. Without those, the reported performance deltas could partly reflect discretization artifacts instead of genuine inverse-problem behavior, which weakens how much weight the three insights can carry.

This paper is aimed at people already working on machine learning for PDEs who want a shared testbed for parameter estimation. A reader who needs data for inverse tasks can get immediate use from the released artifacts even if they rerun some experiments themselves. It deserves a serious referee because the benchmark contribution is real and the community can iterate on it once the simulation quality is clarified in review.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces PDEInvBench, a benchmark dataset of numerical simulations for PDE inverse problems (time-dependent and time-independent) with in- and out-of-distribution splits. It explores neural network design along three axes—optimization (supervised/self-supervised/test-time objectives), representations (architectures and conditioning), and scaling (model/data size)—and reports three empirical insights: two-stage training (parameter supervision followed by PDE-residual test-time fine-tuning) is best, PDE derivatives as input features improve accuracy, and diverse initial conditions yield larger gains than wider PDE parameter ranges. Dataset and codebase are released publicly.

Significance. If the underlying simulations are faithful, the work supplies a needed public benchmark for PDE inverse problems (a gap relative to forward-problem benchmarks) and supplies concrete, actionable guidance on training procedures, input features, and data diversity. The public release of data and code is a clear strength that supports reproducibility and follow-on research.

major comments (1)

[abstract / dataset construction paragraph] Abstract / dataset-construction paragraph: the performance deltas supporting the three listed insights rest on numerical solution fields whose fidelity is not verified by convergence studies, grid-refinement checks, or cross-solver comparisons. Without such evidence the reported gains (two-stage training, derivative features, IC diversity) could partly reflect solver-specific discretization artifacts rather than genuine PDE-inverse behavior.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the fidelity of the numerical simulations. We address the single major comment below.

read point-by-point responses

Referee: [abstract / dataset construction paragraph] Abstract / dataset-construction paragraph: the performance deltas supporting the three listed insights rest on numerical solution fields whose fidelity is not verified by convergence studies, grid-refinement checks, or cross-solver comparisons. Without such evidence the reported gains (two-stage training, derivative features, IC diversity) could partly reflect solver-specific discretization artifacts rather than genuine PDE-inverse behavior.

Authors: We agree that explicit verification of numerical fidelity is important for a benchmark dataset. The simulations were produced with standard, widely used solvers and discretizations drawn from established PDE literature, with resolutions chosen to match common practice for each equation. Because all neural-network variants were evaluated on identical simulation data, any fixed discretization artifacts affect every method equally; the reported relative gains (two-stage training, derivative inputs, IC diversity) therefore reflect differences in how the networks exploit the data rather than solver-specific effects. Nevertheless, to strengthen the manuscript we will add a new subsection to the dataset-construction section that includes grid-refinement studies and limited cross-solver comparisons for representative PDEs, confirming that the chosen discretizations are in the convergent regime. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark study with no circular derivation chain

full rationale

The paper introduces a synthetic dataset of PDE simulations and reports experimental results on neural network performance for inverse problems. The three main insights (two-stage training, derivative features, IC diversity) are direct empirical measurements on held-out splits of that dataset; no equations, fitted parameters, or predictions are defined in terms of themselves or reduced by construction to the training inputs. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The work is self-contained as a benchmark exploration and does not claim any first-principles derivation that collapses to its own data generation procedure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on the fidelity of the numerical PDE solver used to create the data and on standard supervised learning assumptions; no new physical axioms or invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5842 in / 1210 out tokens · 35918 ms · 2026-06-29T22:39:24.067749+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 3 canonical work pages

[1]

URLhttps://openreview.net/forum?id=wNBARGxoJn

ISSN 2835-8856. URLhttps://openreview.net/forum?id=wNBARGxoJn. Y. Du and A. Krishnapriyan. Eddyformer: Accelerated neural simulations of three-dimensional turbulence at scale. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 103372–103403. Curran...

work page doi:10.1016/s0377-0427(99 2025
[2]

Solitons

doi: 10.1073/pnas.2101784118. URLhttps://www.pnas.org/doi/abs/10.1073/pnas.2101784118. _eprint: https://www.pnas.org/doi/pdf/10.1073/pnas.2101784118. G. Kohl, L.-W. Chen, and N. Thuerey. Benchmarking Autoregressive Conditional Diffusion Mod- els for Turbulent Flow Simulation.arXiv, 2023. doi: 10.48550/arXiv.2309.01745. URL https://doi.org/10.48550/arXiv.2...

work page doi:10.1073/pnas.2101784118 2023
[3]

The downsampler output is flattened and fed into an MLP head with one hidden layer of 64 units using ReLU activation

Thus, the downsampler reduces the spatial resolution by a factor of 16. The downsampler output is flattened and fed into an MLP head with one hidden layer of 64 units using ReLU activation. A single value is returned, corresponding to the PDE parameter. DeepONet.Our DeepONet implementation follows the standard branch-truck decomposition with slight change...

2026
[4]

Initial condition scaling: We vary the number of initial conditions per parameter value at {20%, 35%, 50%, 75%, and 100%} of the full dataset
[5]

Parameter scaling: We vary the density of parameter sampling at {20%, 35%, 50%, 75%, and 100%} of the full range
[6]

physics-informed

Temporal scaling: We vary the total temporal horizon on which the model is trained by varying the sampled frames from the first {10%, 20%, 50%, 75%} of the total generated temporal range of our dataset. The evaluation set is the final 25% of the generated temporal range for the in-distribution test setting and the entire temporal trajectory for the OOD ev...

2026
[7]

We find that the optimal setting uses equal weighting between residual and anchor terms

influences performance. We find that the optimal setting uses equal weighting between residual and anchor terms. When the anchor loss is weighted too lightly, the relative error tends to increase with training steps, indicating optimization instability. D.1.4 Per-element vs per-batch tailoring We compare performing TTT on aper-elementbasis (batch size of ...
[8]

more is better

with anchor loss weights of 1 and show results in Figure 9. TTTper-batchgenerally perform the sameper-samplein all evaluation settings. D.1.5 Test Time Tailoring Comparison with varying levels of ICs We compare the performance of test-time training on models trained on 20% of total available initial conditions and 100% of initial conditions by system in F...

work page arXiv 2026

[1] [1]

URLhttps://openreview.net/forum?id=wNBARGxoJn

ISSN 2835-8856. URLhttps://openreview.net/forum?id=wNBARGxoJn. Y. Du and A. Krishnapriyan. Eddyformer: Accelerated neural simulations of three-dimensional turbulence at scale. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 103372–103403. Curran...

work page doi:10.1016/s0377-0427(99 2025

[2] [2]

Solitons

doi: 10.1073/pnas.2101784118. URLhttps://www.pnas.org/doi/abs/10.1073/pnas.2101784118. _eprint: https://www.pnas.org/doi/pdf/10.1073/pnas.2101784118. G. Kohl, L.-W. Chen, and N. Thuerey. Benchmarking Autoregressive Conditional Diffusion Mod- els for Turbulent Flow Simulation.arXiv, 2023. doi: 10.48550/arXiv.2309.01745. URL https://doi.org/10.48550/arXiv.2...

work page doi:10.1073/pnas.2101784118 2023

[3] [3]

The downsampler output is flattened and fed into an MLP head with one hidden layer of 64 units using ReLU activation

Thus, the downsampler reduces the spatial resolution by a factor of 16. The downsampler output is flattened and fed into an MLP head with one hidden layer of 64 units using ReLU activation. A single value is returned, corresponding to the PDE parameter. DeepONet.Our DeepONet implementation follows the standard branch-truck decomposition with slight change...

2026

[4] [4]

Initial condition scaling: We vary the number of initial conditions per parameter value at {20%, 35%, 50%, 75%, and 100%} of the full dataset

[5] [5]

Parameter scaling: We vary the density of parameter sampling at {20%, 35%, 50%, 75%, and 100%} of the full range

[6] [6]

physics-informed

Temporal scaling: We vary the total temporal horizon on which the model is trained by varying the sampled frames from the first {10%, 20%, 50%, 75%} of the total generated temporal range of our dataset. The evaluation set is the final 25% of the generated temporal range for the in-distribution test setting and the entire temporal trajectory for the OOD ev...

2026

[7] [7]

We find that the optimal setting uses equal weighting between residual and anchor terms

influences performance. We find that the optimal setting uses equal weighting between residual and anchor terms. When the anchor loss is weighted too lightly, the relative error tends to increase with training steps, indicating optimization instability. D.1.4 Per-element vs per-batch tailoring We compare performing TTT on aper-elementbasis (batch size of ...

[8] [8]

more is better

with anchor loss weights of 1 and show results in Figure 9. TTTper-batchgenerally perform the sameper-samplein all evaluation settings. D.1.5 Test Time Tailoring Comparison with varying levels of ICs We compare the performance of test-time training on models trained on 20% of total available initial conditions and 100% of initial conditions by system in F...

work page arXiv 2026