The Perception-Physics Paradox: Probing Scientific Alignment with TC-Bench

Adeel Pervez; Andrea Polesello; Caroline Muller; Dingling Yao; Francesco Locatello

arxiv: 2605.24782 · v2 · pith:4HZGD7BInew · submitted 2026-05-23 · 💻 cs.LG

The Perception-Physics Paradox: Probing Scientific Alignment with TC-Bench

Dingling Yao , Andrea Polesello , Adeel Pervez , Caroline Muller , Francesco Locatello This is my paper

Pith reviewed 2026-06-30 13:12 UTC · model grok-4.3

classification 💻 cs.LG

keywords vision foundation modelsscientific alignmentstructural isomorphismtropical cyclonesperception-physics paradoxTC-Benchsatellite imageryrepresentation learning

0 comments

The pith

Vision foundation models rely on visual shortcuts that collapse in intense physical regimes rather than capturing structural invariants.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision foundation models achieve strong predictive performance on satellite imagery through visual correlations that do not correspond to the underlying physical mechanisms. This creates the Perception-Physics Paradox, where models produce correct-looking outputs without correct physical reasoning. The authors introduce scientific alignment as a target for representation learning and operationalize one necessary condition for it through structural isomorphism, in which latent representations must uniquely identify physical systems up to linear reparameterization. They release TC-Bench, an automated benchmark built around tropical cyclone data, and demonstrate that current models depend on shortcuts that break down under high-intensity conditions, so that scaling alone does not produce scientific alignment.

Core claim

Current vision foundation models rely on visual shortcuts that collapse in intense regimes, indicating that scientific alignment does not arise as a natural byproduct of scaling alone. Structural isomorphism serves as a testable necessary condition for scientific alignment because it requires latent representations to uniquely identify physical systems up to linear reparameterization, and the TC-Bench protocol reveals that existing models fail this condition precisely when physical intensity increases.

What carries the argument

Structural isomorphism, the requirement that latent representations uniquely identify physical systems up to linear reparameterization, acting as a proxy for scientific alignment.

If this is right

Standard out-of-distribution accuracy on perception tasks is not a reliable indicator of scientific utility.
Intense physical regimes act as a decisive stress test that exposes reliance on visual correlations.
Representation learning in scientific domains requires objectives that enforce conditions beyond predictive accuracy.
Automated construction of domain-specific benchmarks can systematically expose gaps between perception and physical reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same shortcut problem is likely to appear in other scientific imaging settings such as medical scans or fluid-flow visualizations.
Training procedures that explicitly penalize non-unique representations could be tested as a route to isomorphism.
If structural isomorphism correlates with downstream causal tasks, it could serve as an intermediate check before full physics-informed evaluation.

Load-bearing premise

Structural isomorphism is a valid and sufficient proxy for the broader notion of scientific alignment in physical domains.

What would settle it

A vision foundation model that continues to make accurate physical predictions in the most intense tropical cyclone regimes while failing to satisfy structural isomorphism, or that satisfies isomorphism yet produces incorrect physical inferences.

Figures

Figures reproduced from arXiv: 2605.24782 by Adeel Pervez, Andrea Polesello, Caroline Muller, Dingling Yao, Francesco Locatello.

**Figure 1.** Figure 1: The Perception-Physics Paradox: (a) In intense regimes (Pc < 980 hPa), visually near-identical cyclones (e.g., 905 vs. 915 hPa) exhibit indistinguishable eye and cloud structures, causing vision models to map them to similar latent states and fail to resolve physically meaningful pressure differences. (b) Visually salient temporal changes (e.g., cloud expansion) can induce large predicted intensity shifts … view at source ↗

**Figure 2.** Figure 2: Static Fidelity (Qstat) and Regime Degradation: Normalized absolute error (ξstat) of state estimation across diverse VFM architectures, stratified by intensity. A consistent performance gap is observed across all model families: the Intense regime exhibits significantly higher median errors and variance compared to the Moderate regime. This confirms the Visual Saturation hypothesis, where representations s… view at source ↗

**Figure 3.** Figure 3: Dynamic Coherence Decay (Qdyn) across model families. The gap ξdyn increases with intensity (Pc ↓), with error spikes in the catastrophic regimes (Pc < 920 hPa), indicating degraded temporal alignment in extremes and compounding dynamical miscalibration. 1000 950 900 Minimum Central Pressure Pc (hPa) −20 0 20 Latent PC1 Collapse Zone (a) Latent-Physical Relationship Intense ( < 980 hPa) 940-960 960-980 98… view at source ↗

**Figure 5.** Figure 5: Dataset inspection: Per-agency distributions of minimum central pressure Pc (left) and maximum sustained wind speed Vm (right) in the TC-BENCH dataset. While pressure distributions are broadly consistent across agencies, wind speed exhibits noticeable variability due to differences in averaging conventions (e.g., 1-minute vs. 10-minute winds), underscoring the greater robustness of pressure as a cross-agen… view at source ↗

**Figure 6.** Figure 6: Supervised pixel baseline. Predicted versus target pressure for a ResNet-18 trained from scratch and evaluated on held-out intense storms (Pc < 980 hPa). In the intense regime, a supervised pixel-space model achieves a normalized mean and median absolute error of 0.4 and 0.3 respectively, indicating that pressure remains statistically predictable from imagery despite increasing uncertainty [PITH_FULL_IMAG… view at source ↗

**Figure 7.** Figure 7: Static Fidelity error using the aggregated spatial mean of all spatial tokens. As in [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Dynamic Coherence gap using the aggregated spatial mean of spatial tokens. The coherence error increases with storm intensity, mirroring the behavior observed with CLS-based representations [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: PCA exemplar visualization of the first principal component (PC1) of DINOv3 latent representation. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

**Figure 10.** Figure 10: Manifold inconsistency. Wind–pressure relationships stratified by latitude. (A) Ground truth: Physical constraints imply that low-latitude storms (< 15◦ ) require higher winds to sustain the same pressure deficit than high-latitude storms (> 25◦ ), producing a clear separation (∆V ≈ 15 kt) in the intense regime. (B) Model prediction: The model partially recovers the directional trend but underestimates th… view at source ↗

read the original abstract

While Vision Foundation Models (VFMs) excel at predictive tasks on satellite imagery, their performance can arise from visual correlations rather than underlying structural invariants, making even perception-based out-of-distribution accuracy a poor proxy for scientific utility. As a result, models may look correct without reasoning correctly, a discrepancy we term the Perception-Physics Paradox. To address this gap, we introduce scientific alignment as an implicit objective for representation learning in scientific domains. We study a principled, testable aspect of scientific alignment through structural isomorphism, which requires latent representations to uniquely identify physical systems up to a linear reparameterization. This perspective induces a hierarchy of necessary conditions and yields a systematic probing protocol for physical and causal interpretability. To operationalize this framework, we release TC-Bench, a global, reproducible benchmark dataset with an automated construction pipeline for tropical cyclone research, and show that current VFMs rely on visual shortcuts that collapse in intense regimes, indicating that scientific alignment does not arise as a natural byproduct of scaling alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's real output is TC-Bench plus the structural-isomorphism test; the broader claim that scaling alone cannot produce scientific alignment rests on whether that test actually tracks physical understanding.

read the letter

The new pieces are the named paradox, the structural-isomorphism criterion (latents must identify the physical system up to linear reparameterization), and TC-Bench with its automated global cyclone pipeline. The benchmark construction looks reproducible on paper and targets a concrete domain where visual shortcuts are plausible, so that part is worth having even if the framing is new.

The hierarchy of necessary conditions that follows from the isomorphism definition is a tidy way to organize what would count as alignment. If the full experiments show that current VFMs fail the test specifically in high-intensity regimes while passing easier visual tasks, that would be a usable negative result for anyone training on satellite data.

The soft spot is the gap between the defined test and actual scientific utility. Structural isomorphism could reject models that carry the right invariants in a non-linear form, or accept models that match via spurious correlations. The abstract gives no numbers, error bars, or dataset statistics, so it is impossible to judge how large or consistent the reported failures are. The stress-test concern lands: if the proxy does not reliably separate physical from non-physical representations, the conclusion that alignment is not a byproduct of scaling rests on shaky ground.

This is for groups already working on representation learning for physical sciences or remote-sensing applications. The benchmark itself could be adopted or extended regardless of the paradox language. It deserves a serious referee because the dataset and pipeline are concrete and falsifiable; the conceptual claims can be pressure-tested in review without needing to accept the framing.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the Perception-Physics Paradox, positing that Vision Foundation Models (VFMs) applied to scientific imagery tasks (e.g., tropical cyclone analysis) achieve predictive accuracy via visual correlations rather than structural physical invariants. It defines scientific alignment through the notion of structural isomorphism—latent representations that uniquely identify physical systems up to linear reparameterization—and derives a hierarchy of necessary conditions for physical and causal interpretability. The work releases TC-Bench, a reproducible global benchmark dataset with an automated pipeline for tropical cyclone research, and uses it to argue that current VFMs rely on shortcuts that collapse under intense regimes, implying scientific alignment is not a natural outcome of scaling.

Significance. If the TC-Bench results are robust and the structural-isomorphism proxy is shown to track scientific utility, the framework and benchmark could supply a concrete, falsifiable protocol for evaluating representation learning in physical-science domains, shifting evaluation away from standard OOD accuracy toward invariant-based criteria.

major comments (2)

[Abstract] Abstract: the central empirical claim that 'current VFMs rely on visual shortcuts that collapse in intense regimes' is stated without any reported quantitative metrics, error bars, dataset statistics, or ablation details, leaving the load-bearing evidence for the scaling conclusion unsupported in the provided text.
[Framework] Framework section (structural isomorphism definition): the requirement that latents 'uniquely identify physical systems up to a linear reparameterization' is presented as inducing necessary conditions for scientific alignment, yet the manuscript does not demonstrate that satisfaction of this linear-isomorphism test is either necessary or sufficient for possession of physical invariants; a model could pass via non-causal correlations or fail despite encoding invariants that are not linearly recoverable, directly undermining the claim that TC-Bench failures imply lack of alignment.

minor comments (1)

[Abstract] The term 'Perception-Physics Paradox' is introduced without a formal definition or citation to prior related paradoxes in the literature on shortcut learning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the manuscript. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim that 'current VFMs rely on visual shortcuts that collapse in intense regimes' is stated without any reported quantitative metrics, error bars, dataset statistics, or ablation details, leaving the load-bearing evidence for the scaling conclusion unsupported in the provided text.

Authors: The abstract is a concise summary; the full manuscript provides the requested quantitative metrics, error bars, dataset statistics, and ablation details in the experimental evaluation sections. We will revise the abstract to include key quantitative highlights supporting the central claim. revision: yes
Referee: [Framework] Framework section (structural isomorphism definition): the requirement that latents 'uniquely identify physical systems up to a linear reparameterization' is presented as inducing necessary conditions for scientific alignment, yet the manuscript does not demonstrate that satisfaction of this linear-isomorphism test is either necessary or sufficient for possession of physical invariants; a model could pass via non-causal correlations or fail despite encoding invariants that are not linearly recoverable, directly undermining the claim that TC-Bench failures imply lack of alignment.

Authors: Structural isomorphism is introduced as a testable proxy inducing a hierarchy of necessary conditions for linear recoverability of physical parameters, not as a complete or exhaustive criterion. The TC-Bench experiments illustrate its practical value in surfacing shortcut failures. We will expand the framework section to explicitly discuss scope, limitations, and the distinction between the proxy and full physical invariance. revision: partial

Circularity Check

0 steps flagged

No significant circularity; new framework and benchmark are self-contained

full rationale

The paper explicitly defines structural isomorphism as a testable proxy for scientific alignment and introduces TC-Bench as a new dataset with an automated pipeline. The central empirical claim (VFMs fail in intense regimes) is presented as a result of applying this new benchmark rather than a quantity fitted to or defined by the same inputs. No equations reduce a prediction to a fitted parameter by construction, no self-citation chain is invoked as load-bearing justification, and the derivation does not rename a known result or smuggle an ansatz. The provided text shows an independent proposal of a hierarchy of conditions and a probing protocol.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on abstract; full paper may contain additional parameters or assumptions. The central claim rests on the definition of structural isomorphism as a proxy and the interpretation of performance collapse as evidence against scaling-induced alignment.

axioms (1)

domain assumption Latent representations must uniquely identify physical systems up to a linear reparameterization to satisfy structural isomorphism.
Presented in abstract as the testable aspect of scientific alignment.

invented entities (1)

Perception-Physics Paradox no independent evidence
purpose: Describes discrepancy between perception accuracy and physical reasoning in models.
New term coined in the paper.

pith-pipeline@v0.9.1-grok · 5711 in / 1150 out tokens · 31064 ms · 2026-06-30T13:12:21.620112+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 4 canonical work pages · 2 internal anchors

[1]

Understanding intermediate layers using linear classifier probes

25 Alain, G. and Bengio, Y . Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016. 8 Alet, F., Price, I., El-Kadi, A., Masters, D., Markou, S., An- dersson, T. R., Stott, J., Lam, R., Willson, M., Sanchez- Gonzalez, A., and Battaglia, P. Skillful joint probabilis- tic weather forecasting from marginals.arX...

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

8, 24 Bister, M

ISSN 1476-4687. 8, 24 Bister, M. and Emanuel, K. A. Dissipative heating and hur- ricane intensity.Meteorology and Atmospheric Physics, 65(3):233–240, 1998. 18 Bonev, B., Kurth, T., Mahesh, A., Bisson, M., Kossaifi, J., Kashinath, K., Anandkumar, A., Collins, W. D., Pritchard, M. S., and Keller, A. Fourcastnet 3: A geometric ap- proach to probabilistic mac...

work page arXiv 1998
[3]

Are neural nets modular? inspecting functional modularity through differentiable weight masks

25 Csord´as, R., van Steenkiste, S., and Schmidhuber, J. Are neural nets modular? inspecting functional modularity through differentiable weight masks. InInternational Conference on Learning Representations, 2021. 25 Cummins, R.Meaning and mental representation, vol- ume 24. MIT Press Cambridge, MA, 1989. 3, 9 Desormeaux, Y ., Rossow, W. B., Brest, C. L.,...

2021
[4]

Masked autoencoders are scalable vision learners

1 He, K., Chen, X., Xie, S., Li, Y ., Doll´ar, P., and Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009, 2022. 6, 8 Hewitt, J. and Manning, C. D. A structural probe for finding syntax in word representations. InProceedings of the 2019 Confer...

2022
[5]

Digital typhoon: Long-term satellite image dataset for the spatio-temporal modeling of tropical cyclones.Advances in Neural Information Processing Systems, 36:40623–40636, 2023

1 Kitamoto, A., Hwang, J., Vuillod, B., Gautier, L., Tian, Y ., and Clanuwat, T. Digital typhoon: Long-term satellite image dataset for the spatio-temporal modeling of tropical cyclones.Advances in Neural Information Processing Systems, 36:40623–40636, 2023. 1, 2, 5, 19 Kitamoto, A., Dzik, E., and Faure, G. Machine learning for the digital typhoon dataset...

work page arXiv 2023
[6]

FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators

2, 5 Knapp, K. R., Kruk, M. C., Levinson, D. H., Diamond, H. J., and Neumann, C. J. The international best track archive for climate stewardship (IBTrACS) unifying tropi- cal cyclone data.Bulletin of the American Meteorological Society, 91(3):363–376, 2010. 5, 19 Lam, R., Sanchez-Gonzalez, A., Willson, M., Wirnsberger, P., Fortunato, M., Alet, F., Ravuri,...

work page internal anchor Pith review Pith/arXiv arXiv 2010

[1] [1]

Understanding intermediate layers using linear classifier probes

25 Alain, G. and Bengio, Y . Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016. 8 Alet, F., Price, I., El-Kadi, A., Masters, D., Markou, S., An- dersson, T. R., Stott, J., Lam, R., Willson, M., Sanchez- Gonzalez, A., and Battaglia, P. Skillful joint probabilis- tic weather forecasting from marginals.arX...

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

8, 24 Bister, M

ISSN 1476-4687. 8, 24 Bister, M. and Emanuel, K. A. Dissipative heating and hur- ricane intensity.Meteorology and Atmospheric Physics, 65(3):233–240, 1998. 18 Bonev, B., Kurth, T., Mahesh, A., Bisson, M., Kossaifi, J., Kashinath, K., Anandkumar, A., Collins, W. D., Pritchard, M. S., and Keller, A. Fourcastnet 3: A geometric ap- proach to probabilistic mac...

work page arXiv 1998

[3] [3]

Are neural nets modular? inspecting functional modularity through differentiable weight masks

25 Csord´as, R., van Steenkiste, S., and Schmidhuber, J. Are neural nets modular? inspecting functional modularity through differentiable weight masks. InInternational Conference on Learning Representations, 2021. 25 Cummins, R.Meaning and mental representation, vol- ume 24. MIT Press Cambridge, MA, 1989. 3, 9 Desormeaux, Y ., Rossow, W. B., Brest, C. L.,...

2021

[4] [4]

Masked autoencoders are scalable vision learners

1 He, K., Chen, X., Xie, S., Li, Y ., Doll´ar, P., and Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009, 2022. 6, 8 Hewitt, J. and Manning, C. D. A structural probe for finding syntax in word representations. InProceedings of the 2019 Confer...

2022

[5] [5]

Digital typhoon: Long-term satellite image dataset for the spatio-temporal modeling of tropical cyclones.Advances in Neural Information Processing Systems, 36:40623–40636, 2023

1 Kitamoto, A., Hwang, J., Vuillod, B., Gautier, L., Tian, Y ., and Clanuwat, T. Digital typhoon: Long-term satellite image dataset for the spatio-temporal modeling of tropical cyclones.Advances in Neural Information Processing Systems, 36:40623–40636, 2023. 1, 2, 5, 19 Kitamoto, A., Dzik, E., and Faure, G. Machine learning for the digital typhoon dataset...

work page arXiv 2023

[6] [6]

FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators

2, 5 Knapp, K. R., Kruk, M. C., Levinson, D. H., Diamond, H. J., and Neumann, C. J. The international best track archive for climate stewardship (IBTrACS) unifying tropi- cal cyclone data.Bulletin of the American Meteorological Society, 91(3):363–376, 2010. 5, 19 Lam, R., Sanchez-Gonzalez, A., Willson, M., Wirnsberger, P., Fortunato, M., Alet, F., Ravuri,...

work page internal anchor Pith review Pith/arXiv arXiv 2010