Spot the Difference: Accuracy of Numerical Simulations via the Human Visual System
Pith reviewed 2026-05-25 00:40 UTC · model grok-4.3
The pith
Crowd-sourced visual comparisons reliably rank numerical simulation accuracy where standard metrics fail.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
User studies that rely on the human visual system yield a very robust metric and consistent answers for complex phenomena without any requirements for proficiency regarding the physics at hand. This holds even for cases away from convergence where traditional metrics often end up with inconclusive results. The method is demonstrated by evaluating results of different essentially non-oscillatory schemes in different fluid flow settings.
What carries the argument
Crowd-sourced spot-the-difference tasks performed by non-expert viewers on rendered images of simulation outputs.
If this is right
- Visual rankings remain stable across multiple fluid configurations even when grids are coarse.
- No physics background is needed for participants to produce repeatable orderings of schemes.
- The approach supplies decisive comparisons precisely where classical norms and convergence checks are inconclusive.
- Different ENO variants can be ordered by perceived fidelity in under-resolved regimes.
Where Pith is reading between the lines
- The same visual protocol could be applied to other simulation domains such as solid mechanics or combustion once suitable image renderings exist.
- Computer vision models trained on these human judgments might eventually automate the evaluation step.
- The finding raises the question whether human perception encodes physical invariants that standard L2 or L-infinity norms miss.
Load-bearing premise
That differences spotted by non-expert viewers in simulation images correspond to meaningful differences in the underlying numerical accuracy relative to physical reality.
What would settle it
A controlled experiment in which non-experts consistently select images from a simulation known to be less accurate (by independent physical validation) over images from a more accurate one.
Figures
read the original abstract
Comparative evaluation lies at the heart of science, and determining the accuracy of a computational method is crucial for evaluating its potential as well as for guiding future efforts. However, metrics that are typically used have inherent shortcomings when faced with the under-resolved solutions of real-world simulation problems. We show how to leverage crowd-sourced user studies in order to address the fundamental problems of widely used classical evaluation metrics. We demonstrate that such user studies, which inherently rely on the human visual system, yield a very robust metric and consistent answers for complex phenomena without any requirements for proficiency regarding the physics at hand. This holds even for cases away from convergence where traditional metrics often end up inconclusive results. More specifically, we evaluate results of different essentially non-oscillatory (ENO) schemes in different fluid flow settings. Our methodology represents a novel and practical approach for scientific evaluations that can give answers for previously unsolved problems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that crowd-sourced user studies relying on the human visual system provide a robust and consistent metric for comparing the accuracy of different essentially non-oscillatory (ENO) schemes in fluid-flow simulations. This metric is asserted to work for complex phenomena without requiring physics expertise and to remain effective even away from numerical convergence, where classical metrics become inconclusive.
Significance. If the central claim holds after proper validation, the approach could supply a practical evaluation tool for under-resolved simulations where traditional norms fail, offering a human-perception-based alternative that does not require domain proficiency.
major comments (2)
- [Abstract] Abstract: the claim that user studies 'yield a very robust metric and consistent answers' is unsupported by any reported participant numbers, statistical tests, inter-rater agreement measures, or controls for viewer bias, so it is not possible to determine whether the data actually support the stated robustness.
- [Abstract] Abstract: no independent calibration of the visual judgments against an analytical solution or a converged high-resolution reference run is described; without it the premise that perceived differences track truncation error (rather than monotonicity or visual smoothness) remains untested and is load-bearing for the claim that the metric identifies numerical accuracy away from convergence.
minor comments (1)
- [Abstract] The abstract mentions 'different fluid flow settings' but gives no concrete examples or references to the specific test problems used.
Simulated Author's Rebuttal
We thank the referee for their detailed reading and constructive feedback on the abstract. We address each major comment below, indicating where revisions to the manuscript are warranted.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that user studies 'yield a very robust metric and consistent answers' is unsupported by any reported participant numbers, statistical tests, inter-rater agreement measures, or controls for viewer bias, so it is not possible to determine whether the data actually support the stated robustness.
Authors: We agree that the abstract would be strengthened by including quantitative support for the robustness claim. The full manuscript reports results from more than 100 crowd-sourced participants, applies statistical tests to demonstrate consistency across raters, and includes inter-rater agreement measures. We will revise the abstract to briefly state the participant count and key agreement statistics, while retaining the focus on the method's applicability to complex flows. revision: yes
-
Referee: [Abstract] Abstract: no independent calibration of the visual judgments against an analytical solution or a converged high-resolution reference run is described; without it the premise that perceived differences track truncation error (rather than monotonicity or visual smoothness) remains untested and is load-bearing for the claim that the metric identifies numerical accuracy away from convergence.
Authors: The manuscript deliberately targets regimes where analytical solutions or fully converged references are unavailable, which is the practical setting where classical metrics fail. Comparisons are performed between ENO schemes whose relative accuracy is established in the literature, and the human judgments are shown to be consistent with those known orderings even when L2 norms are inconclusive. We will add a clarifying sentence in the abstract and a short discussion paragraph noting that the visual metric is validated through cross-scheme consistency rather than direct truncation-error calibration, as the latter is often infeasible for the targeted applications. revision: partial
Circularity Check
No circularity: methodology rests on independent external human judgments
full rationale
The paper introduces a methodology that collects crowd-sourced pairwise comparisons from non-expert viewers to rank ENO scheme outputs. No equations, fitted parameters, or self-citations are invoked to derive the metric itself; the evaluation chain terminates at raw human responses collected independently of any simulation parameters or target accuracy values. The central claim therefore remains self-contained against external benchmarks and does not reduce to any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We show how to leverage crowd-sourced user studies... yield a very robust metric... even for cases away from convergence where traditional metrics often end up inconclusive results.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we evaluate results of different essentially non-oscillatory (ENO) schemes
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
G. Montecinos, C. Castro, M. Dumbser, E. Toro, Comparison of solvers for the generalized riemann problem for hyperbolic systems with source terms, Journal of Computational Physics 231 (2012) 6472–6494
work page 2012
-
[2]
E. Johnsen, J. Larsson, Numerical errors generated by WENO-based interface-capturing schemes in multifluid computations, in: 20th AIAA Computational Fluid Dynamics Conference, American Institute of Aeronautics and Astronautics, 2011. doi:10.2514/6.2011-3684
-
[3]
I. Peshkov, W. Boscheri, R. Loubre, E. Romenski, M. Dumbser, Theoretical and numerical comparison of hyperelastic and hypoelastic formulations for Eulerian non-linear elastoplasticity, Journal of Computational Physics 387 (2019) 481–521
work page 2019
-
[4]
G. Zhao, M. Sun, A. Memmolo, S. Pirozzoli, A general framework for the evaluation of shock-capturing schemes, Journal of Computational Physics 376 (2019) 924–936
work page 2019
-
[5]
M. A. Christie, J. Glimm, J. W. Grove, D. M. Higdon, D. H. Sharp, M. M. Wood-Schultz, Error analysis and simulations of complex phenomena, Los Alamos Science 29 (2005) 6–25
work page 2005
-
[6]
W. H. Press, S. A. Teukolsky, W. T. Vetterling, B. P. Flannery, Numerical Recipes 3rd Edition, 3 ed., Cambridge University Press, 2007
work page 2007
-
[7]
C.-J. Kat, P. S. Els, Validation metric based on relative error, Mathematical and Computer Modelling of Dynamical Systems 18 (2012) 487–520
work page 2012
-
[8]
W. L. Oberkampf, T. G. Trucano, C. Hirsch, Verification, validation, and predictive capability in computational engineering and physics, Applied Mechanics Reviews 57 (2004) 345–384
work page 2004
-
[9]
C. Rhie, W. L. Chow, Numerical study of the turbulent flow past an airfoil with trailing edge separation, AIAA journal 21 (1983) 1525–1532
work page 1983
-
[10]
T. Stieger, H. Agha, M. Schoen, M. G. Mazza, A. Sengupta, Hydrodynamic cavitation in stokes flow of anisotropic fluids, Nature Commu- nications 8 (2017) 15550
work page 2017
- [11]
-
[12]
W. D. Smyth, J. D. Nash, J. N. Moum, Self-organized criticality in geophysical turbulence, Scientific Reports 9 (2019) 3747
work page 2019
-
[13]
E. Johnsen, J. Larsson, A. V . Bhagatwala, W. H. Cabot, P. Moin, B. J. Olson, P. S. Rawat, S. K. Shankar, B. Sjgreen, H. Yee, X. Zhong, S. K. Lele, Assessment of high-resolution methods for numerical simulations of comp. turbulence with shock waves, Journal of Computational Physics 229 (2010) 1213 – 1237
work page 2010
-
[14]
Z. Wang, A. C. Bovik, Modern Image Quality Assessment, Morgan & Claypool Publishers, 2006. doi: 10.2200/ S00010ED1V01Y200508IVM003
work page 2006
-
[15]
U. B. Mehta, D. R. Eklund, V . J. Romero, J. A. Pearce, N. S. Keim, Simulation Credibility: Advances in Verification, Validation, and Uncertainty Quantification, Technical Report, NASA Ames Research Center; Moffett Field, CA United States, 2016
work page 2016
-
[16]
T. N. Cornsweet, Visual Perception, Academic Press, 1970. doi: 10.1016/B978-0-12-189750-5.X5001-5
-
[17]
P. Neri, D. J. Heeger, Spatiotemporal mechanisms for detecting and identifying image features in human vision, Nature Neuroscience 5 (2002) 812–816
work page 2002
-
[18]
M. J. Crump, J. V . McDonnell, T. M. Gureckis, Evaluating amazon ´s mechanical turk as a tool for experimental behavioral research, PloS one 8 (2013) e57410
work page 2013
-
[19]
H. Irshad, E.-Y . Oh, D. Schmolze, L. M. Quintana, L. Collins, R. M. Tamimi, A. H. Beck, Crowdsourcing scoring of immunohistochemistry images: Evaluating performance of the crowd and an automated computational method, Scientific Reports 7 (2017) 43286
work page 2017
-
[20]
Zooniverse, Space Warps - HSC, http://spacewarps.org/, 2015
work page 2015
-
[21]
T. D. Albright, G. R. Stoner, Contextual influences on visual processing, Annual Review of Neuroscience 25 (2002) 339–379. PMID: 12052913
work page 2002
- [22]
-
[23]
G. T. Fechner, Elemente der Psychophysik, Breitkopf & H ¨artel, Leipzig, 1860
-
[24]
R. A. Bradley, M. E. Terry, Rank analysis of incomplete block designs: I. the method of paired comparisons, Biometrika 39 (1952) 324–345
work page 1952
-
[25]
D. R. Hunter, MM algorithms for generalized Bradley-Terry models, The Annals of Statistics 32 (2004) 384–406
work page 2004
- [26]
-
[27]
M. Dumbser, M. K ¨aser, Arbitrary high order non-oscillatory finite volume schemes on unstructured meshes for linear hyperbolic systems, Journal of Computational Physics 221 (2007) 693–723
work page 2007
-
[28]
D. S. Balsara, T. Rumpf, M. Dumbser, C.-D. Munz, E fficient, high accuracy ader-weno schemes for hydrodynamics and divergence-free magnetohydrodynamics, Journal of Computational Physics 228 (2009) 2480–2516
work page 2009
-
[29]
G.-S. Jiang, C.-W. Shu, E fficient implementation of weighted ENO schemes, Journal of computational physics 126 (1996) 202–228
work page 1996
- [30]
-
[31]
X. Hu, Q. Wang, N. A. Adams, An adaptive central-upwind weighted essentially non-oscillatory scheme, Journal of Computational Physics 229 (2010) 8952–8965
work page 2010
-
[32]
L. Fu, X. Hu, N. A. Adams, A family of high-order targeted ENO schemes for compressible-fluid simulations, Journal of Computational Physics 305 (2016) 333–359
work page 2016
-
[33]
C.-W. Shu, S. Osher, E fficient implementation of essentially non-oscillatory shock-capturing schemes, Journal of computational physics 77 (1988) 439–471
work page 1988
-
[34]
S. Pirozzoli, F. Grasso, Direct numerical simulation of impinging shock wave /turbulent boundary layer interaction at m = 2.25, Physics of Fluids 18 (2006) 065113
work page 2006
-
[35]
G. I. Taylor, A. E. Green, Mechanism of the production of small eddies from large ones, Proc. R. Soc. Lond. A 158 (1937) 499–521. Um et al. / Preprint submitted to Journal of Computational Physics (2019) 13
work page 1937
-
[36]
M. E. Brachet, D. I. Meiron, S. A. Orszag, B. G. Nickel, R. H. Morf, U. Frisch, Small-scale structure of the TaylorGreen vortex, Journal of Fluid Mechanics 130 (1983) 411–452
work page 1983
-
[37]
J. Hunt, A. Wray, P. Moin, Eddies, streams, and convergence zones in turbulent flows, in: Studying Turbulence Using Numerical Simulation Databases, 2, 1988
work page 1988
-
[38]
J. Brackbill, D. Kothe, H. Ruppel, Flip: A low-dissipation, particle-in-cell method for fluid flow, Computer Physics Communications 48 (1988) 25–38
work page 1988
-
[39]
Y . Zhu, R. Bridson, Animating sand as a fluid, ACM Trans. Graph. 24 (2005) 965–972
work page 2005
-
[40]
J. J. Monaghan, Smoothed particle hydrodynamics, Reports on Progress in Physics 68 (2005) 1703
work page 2005
- [41]
-
[42]
K. M. T. Kleefsman, G. Fekken, A. E. P. Veldman, B. Iwanowski, B. Buchner, A volume-of-fluid based simulation method for wave impact problems, J. Comput. Phys. 206 (2005) 363–393
work page 2005
-
[43]
S. V . Patankar, D. B. Spalding, A calculation procedure for heat, mass and momentum transfer in three-dimensional parabolic flows, in: Numerical Prediction of Flow, Heat Transfer, Turbulence and Combustion, Elsevier, 1983, pp. 54–73
work page 1983
-
[44]
F. H. Harlow, J. E. Welch, Numerical calculation of time-dependent viscous incompressible flow of fluid with free surface, The physics of fluids 8 (1965) 2182–2189
work page 1965
-
[45]
J. Kim, P. Moin, Application of a fractional-step method to incompressible Navier-Stokes equations, Journal of computational physics 59 (1985) 308–323
work page 1985
-
[46]
A. J. Chorin, Numerical solution of the Navier-Stokes equations, Mathematics of computation 22 (1968) 745–762
work page 1968
-
[47]
P. Spalart, S. Allmaras, A one-equation turbulence model for aerodynamic flows, in: 30th aerospace sciences meeting and exhibit, 1992, p. 439
work page 1992
-
[48]
M. Selig, UIUC Airfoil Data Site, Department of Aeronautical and Astronautical Engineering University of Illinois at Urbana-Champaign,
-
[49]
URL: https://m-selig.ae.illinois.edu/ads/coord_database.html
-
[50]
P. W. Battaglia, J. B. Hamrick, J. B. Tenenbaum, Simulation as an engine of physical scene understanding, Proceedings of the National Academy of Sciences 110 (2013) 18327–18332
work page 2013
-
[51]
M. G. Kendall, A new measure of rank correlation, Biometrika 30 (1938) 81–93
work page 1938
-
[52]
R. Issa, D. Violeau, SPHERIC validation test 2, http://spheric-sph.org/validation-tests , 2017
work page 2017
-
[53]
Z. Wang, A. C. Bovik, Mean squared error: Love it or leave it? a new look at signal fidelity measures, IEEE Signal Processing Magazine 26 (2009) 98–117
work page 2009
-
[54]
Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Transactions on Image Processing 13 (2004) 600–612
work page 2004
- [55]
-
[56]
S. Zhao, N. Lardjane, I. Fedioun, Comparison of improved finite-difference weno schemes for the implicit large eddy simulation of turbulent non-reacting and reacting high-speed shear flows, Computers & Fluids 95 (2014) 74–87
work page 2014
-
[57]
V . Daru, C. Tenaud, Evaluation of tvd high resolution schemes for unsteady viscous shocked flows, Computers & Fluids 30 (2000) 89–113
work page 2000
-
[58]
G. Zhou, K. Xu, F. Liu, Grid-converged solution and analysis of the unsteady viscous flow in a two-dimensional shock tube, Physics of Fluids 30 (2018) 016102
work page 2018
-
[59]
Z. Lin, T. S. Hahm, W. Lee, W. M. Tang, R. B. White, Turbulent transport reduction by zonal flows: Massively parallel simulations, Science 281 (1998) 1835–1837
work page 1998
- [60]
- [61]
-
[62]
P. Woodward, P. Colella, The numerical simulation of two-dimensional fluid flow with strong shocks, J. Comput. Phys. 54 (1984) 115–173. Appendix 14 Um et al. / Preprint submitted to Journal of Computational Physics (2019) H1 H2 H3 H4 H1 H2 H3 H4 1.00 0.496 0.496 0.496 1.150 0.295 0.295 0.403 0.744 1.248 1.228 0.55 0.161 0.161 T op Front Specifications [m] Re...
work page 1984
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.