The Perception-Physics Paradox: Probing Scientific Alignment with TC-Bench
Pith reviewed 2026-06-30 13:12 UTC · model grok-4.3
The pith
Vision foundation models rely on visual shortcuts that collapse in intense physical regimes rather than capturing structural invariants.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Current vision foundation models rely on visual shortcuts that collapse in intense regimes, indicating that scientific alignment does not arise as a natural byproduct of scaling alone. Structural isomorphism serves as a testable necessary condition for scientific alignment because it requires latent representations to uniquely identify physical systems up to linear reparameterization, and the TC-Bench protocol reveals that existing models fail this condition precisely when physical intensity increases.
What carries the argument
Structural isomorphism, the requirement that latent representations uniquely identify physical systems up to linear reparameterization, acting as a proxy for scientific alignment.
If this is right
- Standard out-of-distribution accuracy on perception tasks is not a reliable indicator of scientific utility.
- Intense physical regimes act as a decisive stress test that exposes reliance on visual correlations.
- Representation learning in scientific domains requires objectives that enforce conditions beyond predictive accuracy.
- Automated construction of domain-specific benchmarks can systematically expose gaps between perception and physical reasoning.
Where Pith is reading between the lines
- The same shortcut problem is likely to appear in other scientific imaging settings such as medical scans or fluid-flow visualizations.
- Training procedures that explicitly penalize non-unique representations could be tested as a route to isomorphism.
- If structural isomorphism correlates with downstream causal tasks, it could serve as an intermediate check before full physics-informed evaluation.
Load-bearing premise
Structural isomorphism is a valid and sufficient proxy for the broader notion of scientific alignment in physical domains.
What would settle it
A vision foundation model that continues to make accurate physical predictions in the most intense tropical cyclone regimes while failing to satisfy structural isomorphism, or that satisfies isomorphism yet produces incorrect physical inferences.
Figures
read the original abstract
While Vision Foundation Models (VFMs) excel at predictive tasks on satellite imagery, their performance can arise from visual correlations rather than underlying structural invariants, making even perception-based out-of-distribution accuracy a poor proxy for scientific utility. As a result, models may look correct without reasoning correctly, a discrepancy we term the Perception-Physics Paradox. To address this gap, we introduce scientific alignment as an implicit objective for representation learning in scientific domains. We study a principled, testable aspect of scientific alignment through structural isomorphism, which requires latent representations to uniquely identify physical systems up to a linear reparameterization. This perspective induces a hierarchy of necessary conditions and yields a systematic probing protocol for physical and causal interpretability. To operationalize this framework, we release TC-Bench, a global, reproducible benchmark dataset with an automated construction pipeline for tropical cyclone research, and show that current VFMs rely on visual shortcuts that collapse in intense regimes, indicating that scientific alignment does not arise as a natural byproduct of scaling alone.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Perception-Physics Paradox, positing that Vision Foundation Models (VFMs) applied to scientific imagery tasks (e.g., tropical cyclone analysis) achieve predictive accuracy via visual correlations rather than structural physical invariants. It defines scientific alignment through the notion of structural isomorphism—latent representations that uniquely identify physical systems up to linear reparameterization—and derives a hierarchy of necessary conditions for physical and causal interpretability. The work releases TC-Bench, a reproducible global benchmark dataset with an automated pipeline for tropical cyclone research, and uses it to argue that current VFMs rely on shortcuts that collapse under intense regimes, implying scientific alignment is not a natural outcome of scaling.
Significance. If the TC-Bench results are robust and the structural-isomorphism proxy is shown to track scientific utility, the framework and benchmark could supply a concrete, falsifiable protocol for evaluating representation learning in physical-science domains, shifting evaluation away from standard OOD accuracy toward invariant-based criteria.
major comments (2)
- [Abstract] Abstract: the central empirical claim that 'current VFMs rely on visual shortcuts that collapse in intense regimes' is stated without any reported quantitative metrics, error bars, dataset statistics, or ablation details, leaving the load-bearing evidence for the scaling conclusion unsupported in the provided text.
- [Framework] Framework section (structural isomorphism definition): the requirement that latents 'uniquely identify physical systems up to a linear reparameterization' is presented as inducing necessary conditions for scientific alignment, yet the manuscript does not demonstrate that satisfaction of this linear-isomorphism test is either necessary or sufficient for possession of physical invariants; a model could pass via non-causal correlations or fail despite encoding invariants that are not linearly recoverable, directly undermining the claim that TC-Bench failures imply lack of alignment.
minor comments (1)
- [Abstract] The term 'Perception-Physics Paradox' is introduced without a formal definition or citation to prior related paradoxes in the literature on shortcut learning.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on the manuscript. We address each major comment below and indicate planned revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central empirical claim that 'current VFMs rely on visual shortcuts that collapse in intense regimes' is stated without any reported quantitative metrics, error bars, dataset statistics, or ablation details, leaving the load-bearing evidence for the scaling conclusion unsupported in the provided text.
Authors: The abstract is a concise summary; the full manuscript provides the requested quantitative metrics, error bars, dataset statistics, and ablation details in the experimental evaluation sections. We will revise the abstract to include key quantitative highlights supporting the central claim. revision: yes
-
Referee: [Framework] Framework section (structural isomorphism definition): the requirement that latents 'uniquely identify physical systems up to a linear reparameterization' is presented as inducing necessary conditions for scientific alignment, yet the manuscript does not demonstrate that satisfaction of this linear-isomorphism test is either necessary or sufficient for possession of physical invariants; a model could pass via non-causal correlations or fail despite encoding invariants that are not linearly recoverable, directly undermining the claim that TC-Bench failures imply lack of alignment.
Authors: Structural isomorphism is introduced as a testable proxy inducing a hierarchy of necessary conditions for linear recoverability of physical parameters, not as a complete or exhaustive criterion. The TC-Bench experiments illustrate its practical value in surfacing shortcut failures. We will expand the framework section to explicitly discuss scope, limitations, and the distinction between the proxy and full physical invariance. revision: partial
Circularity Check
No significant circularity; new framework and benchmark are self-contained
full rationale
The paper explicitly defines structural isomorphism as a testable proxy for scientific alignment and introduces TC-Bench as a new dataset with an automated pipeline. The central empirical claim (VFMs fail in intense regimes) is presented as a result of applying this new benchmark rather than a quantity fitted to or defined by the same inputs. No equations reduce a prediction to a fitted parameter by construction, no self-citation chain is invoked as load-bearing justification, and the derivation does not rename a known result or smuggle an ansatz. The provided text shows an independent proposal of a hierarchy of conditions and a probing protocol.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Latent representations must uniquely identify physical systems up to a linear reparameterization to satisfy structural isomorphism.
invented entities (1)
-
Perception-Physics Paradox
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Understanding intermediate layers using linear classifier probes
25 Alain, G. and Bengio, Y . Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016. 8 Alet, F., Price, I., El-Kadi, A., Masters, D., Markou, S., An- dersson, T. R., Stott, J., Lam, R., Willson, M., Sanchez- Gonzalez, A., and Battaglia, P. Skillful joint probabilis- tic weather forecasting from marginals.arX...
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
ISSN 1476-4687. 8, 24 Bister, M. and Emanuel, K. A. Dissipative heating and hur- ricane intensity.Meteorology and Atmospheric Physics, 65(3):233–240, 1998. 18 Bonev, B., Kurth, T., Mahesh, A., Bisson, M., Kossaifi, J., Kashinath, K., Anandkumar, A., Collins, W. D., Pritchard, M. S., and Keller, A. Fourcastnet 3: A geometric ap- proach to probabilistic mac...
-
[3]
Are neural nets modular? inspecting functional modularity through differentiable weight masks
25 Csord´as, R., van Steenkiste, S., and Schmidhuber, J. Are neural nets modular? inspecting functional modularity through differentiable weight masks. InInternational Conference on Learning Representations, 2021. 25 Cummins, R.Meaning and mental representation, vol- ume 24. MIT Press Cambridge, MA, 1989. 3, 9 Desormeaux, Y ., Rossow, W. B., Brest, C. L.,...
2021
-
[4]
Masked autoencoders are scalable vision learners
1 He, K., Chen, X., Xie, S., Li, Y ., Doll´ar, P., and Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009, 2022. 6, 8 Hewitt, J. and Manning, C. D. A structural probe for finding syntax in word representations. InProceedings of the 2019 Confer...
2022
-
[5]
1 Kitamoto, A., Hwang, J., Vuillod, B., Gautier, L., Tian, Y ., and Clanuwat, T. Digital typhoon: Long-term satellite image dataset for the spatio-temporal modeling of tropical cyclones.Advances in Neural Information Processing Systems, 36:40623–40636, 2023. 1, 2, 5, 19 Kitamoto, A., Dzik, E., and Faure, G. Machine learning for the digital typhoon dataset...
-
[6]
2, 5 Knapp, K. R., Kruk, M. C., Levinson, D. H., Diamond, H. J., and Neumann, C. J. The international best track archive for climate stewardship (IBTrACS) unifying tropi- cal cyclone data.Bulletin of the American Meteorological Society, 91(3):363–376, 2010. 5, 19 Lam, R., Sanchez-Gonzalez, A., Willson, M., Wirnsberger, P., Fortunato, M., Alet, F., Ravuri,...
work page internal anchor Pith review Pith/arXiv arXiv 2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.