Applying Vision Transformers on Spectral Analysis of Astronomical Objects
Pith reviewed 2026-05-19 12:12 UTC · model grok-4.3
The pith
Vision transformers pretrained on images classify astronomical objects and estimate redshifts when spectra are plotted as 2D images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By converting one-dimensional spectra into two-dimensional image representations, pretrained Vision Transformers can capture both local and global spectral features through spatial self-attention. When fine-tuned on millions of real spectra from the SDSS and LAMOST surveys, the model achieves higher classification accuracy than support vector machines and random forests while attaining R² values comparable to AstroCLIP's spectrum encoder, and it generalizes across diverse object types. This constitutes the first application of Vision Transformers to large-scale spectroscopic analysis that uses real data rather than synthetic inputs.
What carries the argument
Conversion of 1D spectra to 2D image plots fed to a Vision Transformer pretrained on ImageNet and fine-tuned on SDSS and LAMOST data, using spatial self-attention to process the plotted spectral features.
If this is right
- Classification accuracy on stellar objects exceeds that of support vector machines and random forests.
- Redshift estimation achieves R² values comparable to those of AstroCLIP's dedicated spectrum encoder.
- The model maintains performance when generalizing across diverse object types in the test set.
- The method scales to large spectroscopic datasets from major surveys without requiring synthetic training data.
- Spatial self-attention on the image representation extracts both local wavelength features and global spectral shape.
Where Pith is reading between the lines
- Similar image-conversion steps could let vision models handle other one-dimensional scientific signals such as time-series photometry or radial-velocity curves.
- The success suggests that general pretraining on natural images transfers useful priors for pattern recognition in plotted scientific data.
- Future work could test whether different plotting conventions (log versus linear scales, color maps) change which spectral features the attention layers emphasize.
- The approach may reduce the need for astronomy-specific encoders when large labeled spectral catalogs already exist.
Load-bearing premise
Plotting one-dimensional spectra as two-dimensional images preserves all scientifically relevant information without adding artifacts that harm the transformer's downstream performance.
What would settle it
Training an equivalent model directly on raw one-dimensional spectral vectors and finding it matches or exceeds the plotted-image ViT on the same classification and redshift tasks would falsify the necessity of the image conversion step.
Figures
read the original abstract
We apply pre-trained Vision Transformers (ViTs), originally developed for image recognition, to the analysis of astronomical spectral data. By converting traditional one-dimensional spectra into two-dimensional image representations, we enable ViTs to capture both local and global spectral features through spatial self-attention. We fine-tune a ViT pretrained on ImageNet using millions of spectra from the SDSS and LAMOST surveys, represented as spectral plots. Our model is evaluated on key tasks including stellar object classification and redshift ($z$) estimation, where it demonstrates strong performance and scalability. We achieve classification accuracy higher than Support Vector Machines and Random Forests, and attain $R^2$ values comparable to AstroCLIP's spectrum encoder, even when generalizing across diverse object types. These results demonstrate the effectiveness of using pretrained vision models for spectroscopic data analysis. To our knowledge, this is the first application of ViTs to large-scale, which also leverages real spectroscopic data and does not rely on synthetic inputs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes converting one-dimensional astronomical spectra into two-dimensional image plots to enable the use of pre-trained Vision Transformers (ViTs) for stellar object classification and redshift estimation. The authors fine-tune an ImageNet-pretrained ViT on millions of real spectra from the SDSS and LAMOST surveys and report higher classification accuracy than SVM and Random Forest baselines along with R² values comparable to AstroCLIP's spectrum encoder, with generalization across diverse object types. The work positions itself as the first application of ViTs to large-scale spectroscopic analysis using real rather than synthetic data.
Significance. If the performance claims hold under rigorous validation, the results indicate that off-the-shelf vision models can be adapted for spectroscopic tasks without custom architectures, offering a scalable route for processing large survey datasets. The emphasis on real observational data rather than simulations is a clear strength. However, the overall significance is limited by the absence of methodological specifics needed to assess robustness, particularly for the redshift regression task where precise feature localization matters.
major comments (2)
- [Abstract] Abstract: The reported classification accuracy gains over SVM/RF and the R² comparability to AstroCLIP are presented without any mention of train-test split ratios, cross-validation strategy, number of runs, or error bars. These omissions are load-bearing for the central generalization claim across object types and prevent independent verification of whether post-hoc choices inflated the results.
- [Methods] Methods (spectral-to-image conversion): The core assumption that rendering 1D spectra as 2D plots (with unspecified binning, axis scaling, resolution, and anti-aliasing) transmits all scientifically relevant information is not justified or ablated. For redshift estimation this is especially problematic, as line centroids must be localized to sub-pixel accuracy; standard plotting pipelines can introduce quantization or blurring that a ViT pretrained on natural images may not recover, undermining direct comparability to AstroCLIP.
minor comments (2)
- [Abstract] The abstract's claim of being 'the first application of ViTs to large-scale' spectroscopic analysis should be supported by a concise literature review in the introduction to contextualize novelty.
- [Figures] Figure captions and the methods description would benefit from explicit statements of image resolution, wavelength range mapping, and flux normalization choices to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of the manuscript. We provide point-by-point responses below and have revised the paper to address the concerns raised.
read point-by-point responses
-
Referee: [Abstract] Abstract: The reported classification accuracy gains over SVM/RF and the R² comparability to AstroCLIP are presented without any mention of train-test split ratios, cross-validation strategy, number of runs, or error bars. These omissions are load-bearing for the central generalization claim across object types and prevent independent verification of whether post-hoc choices inflated the results.
Authors: We agree that these details are necessary to support the generalization claims. In the revised manuscript we have added a new subsection in Methods that specifies the data split (80/20 train/test with stratification by object class), the use of 5-fold cross-validation, performance averaged over 10 independent runs, and the inclusion of standard deviations as error bars in all tables and figures. The abstract has been updated with a brief statement that results are reported as means with uncertainties from multiple runs. revision: yes
-
Referee: [Methods] Methods (spectral-to-image conversion): The core assumption that rendering 1D spectra as 2D plots (with unspecified binning, axis scaling, resolution, and anti-aliasing) transmits all scientifically relevant information is not justified or ablated. For redshift estimation this is especially problematic, as line centroids must be localized to sub-pixel accuracy; standard plotting pipelines can introduce quantization or blurring that a ViT pretrained on natural images may not recover, undermining direct comparability to AstroCLIP.
Authors: We acknowledge the need for greater methodological transparency. The revised Methods section now details the conversion pipeline: spectra are interpolated to a fixed 1000-point wavelength grid, plotted with linear wavelength scaling on the x-axis and logarithmic flux on the y-axis, rendered at 224×224 pixel resolution using anti-aliased lines, and saved as PNG images. We have also added an ablation study that varies binning density, resolution, and scaling choices and reports the resulting changes in classification accuracy and redshift R². While we recognize that 2D rendering may introduce minor quantization relative to native 1D encoders, the ablation shows that performance remains stable across reasonable parameter ranges and continues to match AstroCLIP levels, which we now discuss explicitly as a limitation. revision: yes
Circularity Check
Empirical ML application with external baselines exhibits no circularity
full rationale
The paper applies a pre-trained ImageNet ViT to 1D spectra rendered as 2D plots, then reports measured classification accuracy and redshift R² on SDSS/LAMOST data. These metrics are compared against independent external baselines (SVM, RF, AstroCLIP spectrum encoder) rather than being derived from or equivalent to any fitted parameters or self-referential definitions within the work. No equations, uniqueness theorems, or ansatzes reduce the central claims to the inputs by construction; the derivation chain consists of standard fine-tuning and evaluation steps that remain falsifiable against the cited external methods.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption ImageNet-pretrained weights provide useful inductive bias for spectral plot images
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By converting traditional one-dimensional spectra into two-dimensional image representations, we enable ViTs to capture both local and global spectral features through spatial self-attention.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We achieve classification accuracy higher than Support Vector Machines and Random Forests, and attain R² values comparable to AstroCLIP's spectrum encoder
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Burrows, A. & Orton, G. 2010, in Exoplanets, ed. S. Seager (University of Ari- zona Press, Tucson), 419–440
work page 2010
- [3]
-
[4]
2021, in 2021 IEEE /CVF International Conference on Computer Vision (ICCV), 9630–9640
Caron, M., Touvron, H., Misra, I., et al. 2021, in 2021 IEEE /CVF International Conference on Computer Vision (ICCV), 9630–9640
work page 2021
-
[5]
J., Moles, M., Cristóbal-Hornillos, D., et al
Cenarro, A. J., Moles, M., Cristóbal-Hornillos, D., et al. 2019, A&A, 622, A176
work page 2019
-
[6]
2001, Monthly Notices of the Royal Astronomical Society, 328, 1039
Colless, M., Dalton, G., Maddox, S., et al. 2001, Monthly Notices of the Royal Astronomical Society, 328, 1039
work page 2001
- [7]
-
[8]
L., Sordo, R., Pailler, F., et al
Creevey, O. L., Sordo, R., Pailler, F., et al. 2023, A&A, 674, A26
work page 2023
-
[9]
2025, A&A, 693, A95 de Jong, R
Daoutis, C., Zezas, A., Kyritsis, E., Kouroumpatzakis, K., & Bonfini, P. 2025, A&A, 693, A95 de Jong, R. S., Barden, S. C., Bellido-Tirado, O., et al. 2016, in Ground-based and Airborne Instrumentation for Astronomy VI, ed. C. J. Evans, L. Simard, & H. Takami, V ol. 9908, International Society for Optics and Photonics (SPIE), 99081O
work page 2025
-
[10]
2009, in 2009 IEEE Conference on Com- puter Vision and Pattern Recognition, 248–255
Deng, J., Dong, W., Socher, R., et al. 2009, in 2009 IEEE Conference on Com- puter Vision and Pattern Recognition, 248–255
work page 2009
-
[11]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. 2018, arXiv e-prints, arXiv:1810.04805
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[12]
Donoso-Oliva, C., Becker, I., Protopapas, P., et al. 2023, A&A, 670, A54
work page 2023
-
[13]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. 2020, arXiv e-prints, arXiv:2010.11929
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[14]
Frontera-Pons, J., Sureau, F., Moraes, B., Bobin, J., & Abdalla, F. B. 2019, A&A, 625, A73 Gaia Collaboration, Prusti, T., de Bruijne, J. H. J., et al. 2016, A&A, 595, A1
work page 2019
-
[15]
Gray, D. F. 2005, The Observation and Analysis of Stellar Photospheres, 3rd edn. (Cambridge, UK: Cambridge University Press)
work page 2005
-
[16]
Hahn, C., Wilson, M. J., Ruiz-Macias, O., et al. 2023, AJ, 165, 253
work page 2023
- [17]
-
[18]
1929, Proceedings of the National Academy of Sciences, 15, 168
Hubble, E. 1929, Proceedings of the National Academy of Sciences, 15, 168
work page 1929
-
[19]
Kollmeier, J., Anderson, S. F., Blanc, G. A., et al. 2019, in Bulletin of the Amer- ican Astronomical Society, V ol. 51, 274
work page 2019
-
[20]
Krizhevsky, A., Sutskever, I., & Hinton, G. E. 2012, in Advances in Neural In- formation Processing Systems, V ol. 25, 1097–1105 Kügler, S. D., Polsterer, K., & Hoecker, M. 2015, A&A, 576, A132
work page 2012
-
[21]
1998, Proceedings of the IEEE, 86, 2278
LeCun, Y ., Bottou, L., Bengio, Y ., & Haffner, P. 1998, Proceedings of the IEEE, 86, 2278
work page 1998
-
[22]
2015, Research in Astronomy and Astro- physics, 15, 1095
Luo, A.-L., Zhao, Y ., Gang, Z., et al. 2015, Research in Astronomy and Astro- physics, 15, 1095
work page 2015
-
[23]
Machado, D. P., Leonard, A., Starck, J. L., Abdalla, F. B., & Jouvel, S. 2013, A&A, 560, A83
work page 2013
-
[24]
Maiolino, R. & Mannucci, F. 2019, Astronomy and Astrophysics Review, 27, 3
work page 2019
-
[25]
Marchetti, A., Granett, B. R., Guzzo, L., et al. 2012, Monthly Notices of the Royal Astronomical Society, 428, 1424
work page 2012
-
[26]
2024, Monthly Notices of the Royal Astronomical Society, 531, 4990
Parker, L., Lanusse, F., Golkar, S., et al. 2024, Monthly Notices of the Royal Astronomical Society, 531, 4990
work page 2024
-
[27]
Podsztavek, O., Škoda, P., & Tvrdík, P. 2022, Astronomy and Computing, 40, 100615 Pâris, Isabelle, Petitjean, Patrick, Ross, Nicholas P., et al. 2017, A&A, 597, A79
work page 2022
-
[28]
Radford, A., Wu, J., Child, R., et al. 2019
work page 2019
-
[29]
J., Bautista, J., Tojeiro, R., et al
Ross, A. J., Bautista, J., Tojeiro, R., et al. 2020, MNRAS, 498, 2354
work page 2020
-
[30]
2021, in American Astronomical So- ciety Meeting Abstracts, V ol
Schlegel, D., Dey, A., Herrera, D., et al. 2021, in American Astronomical So- ciety Meeting Abstracts, V ol. 237, American Astronomical Society Meeting Abstracts, 235.03
work page 2021
-
[31]
Scodeggio, M., Guzzo, L., Garilli, B., et al. 2018, A&A, 609, A84
work page 2018
-
[32]
C., Sobeck, J., & the MSE Team
Sheinis, A., Barden, S. C., Sobeck, J., & the MSE Team. 2023, Astronomische Nachrichten, 344, e20230108
work page 2023
-
[33]
Simonyan, K. & Zisserman, A. 2015, in International Conference on Learning Representations
work page 2015
-
[34]
Stoughton, C., Lupton, R. H., Bernardi, M., et al. 2002, 123, 485
work page 2002
-
[35]
2017, in Advances in Neural Informa- tion Processing Systems, ed
Vaswani, A., Shazeer, N., Parmar, N., et al. 2017, in Advances in Neural Informa- tion Processing Systems, ed. I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett, V ol. 30 (Curran Associates, Inc.)
work page 2017
-
[36]
Vavilova, I. B., Dobrycheva, D. V ., Vasylenko, M. Y ., et al. 2021, A&A, 648, A122 Véron-Cetty, M. P. & Véron, P. 2010, A&A, 518, A10
work page 2021
-
[37]
Wahlgren, G. M. 2011, Canadian Journal of Physics, 89, 345
work page 2011
-
[38]
Wang, C., Bai, Y ., López-Sanjuan, C., et al. 2022, A&A, 659, A144
work page 2022
-
[39]
Willett, K. W., Lintott, C. J., Bamford, S. P., et al. 2013, Monthly Notices of the Royal Astronomical Society, 435, 2835
work page 2013
-
[40]
M., Gawiser, E., & Prochaska, J
Wolfe, A. M., Gawiser, E., & Prochaska, J. X. 2005, Annual Review of Astron- omy and Astrophysics, 43, 861
work page 2005
-
[41]
2018, The Astrophysical Journal Supplement Series, 234, 5 Article number, page 9 of 9
Yang, M., Wu, H., Yang, F., et al. 2018, The Astrophysical Journal Supplement Series, 234, 5 Article number, page 9 of 9
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.