pith. sign in

arxiv: 2506.00294 · v1 · submitted 2025-05-30 · 🌌 astro-ph.IM · cs.CV

Applying Vision Transformers on Spectral Analysis of Astronomical Objects

Pith reviewed 2026-05-19 12:12 UTC · model grok-4.3

classification 🌌 astro-ph.IM cs.CV
keywords Vision TransformersAstronomical spectraSpectral classificationRedshift estimationSDSSLAMOSTMachine learningImage representation of spectra
0
0 comments X

The pith

Vision transformers pretrained on images classify astronomical objects and estimate redshifts when spectra are plotted as 2D images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether vision transformers originally built for photographs can analyze astronomical spectra. It converts one-dimensional light measurements into two-dimensional plots so the model's spatial attention can examine both nearby wavelength details and broader patterns across the full spectrum. The approach fine-tunes a model on millions of real spectra from large sky surveys and evaluates it on separating different types of stars and galaxies as well as estimating their distances. Performance exceeds standard machine-learning baselines on classification and reaches levels comparable to specialized spectrum models on regression, even across mixed object populations. This opens a route for reusing general-purpose image models on scientific line data without building new architectures from scratch.

Core claim

By converting one-dimensional spectra into two-dimensional image representations, pretrained Vision Transformers can capture both local and global spectral features through spatial self-attention. When fine-tuned on millions of real spectra from the SDSS and LAMOST surveys, the model achieves higher classification accuracy than support vector machines and random forests while attaining R² values comparable to AstroCLIP's spectrum encoder, and it generalizes across diverse object types. This constitutes the first application of Vision Transformers to large-scale spectroscopic analysis that uses real data rather than synthetic inputs.

What carries the argument

Conversion of 1D spectra to 2D image plots fed to a Vision Transformer pretrained on ImageNet and fine-tuned on SDSS and LAMOST data, using spatial self-attention to process the plotted spectral features.

If this is right

  • Classification accuracy on stellar objects exceeds that of support vector machines and random forests.
  • Redshift estimation achieves R² values comparable to those of AstroCLIP's dedicated spectrum encoder.
  • The model maintains performance when generalizing across diverse object types in the test set.
  • The method scales to large spectroscopic datasets from major surveys without requiring synthetic training data.
  • Spatial self-attention on the image representation extracts both local wavelength features and global spectral shape.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar image-conversion steps could let vision models handle other one-dimensional scientific signals such as time-series photometry or radial-velocity curves.
  • The success suggests that general pretraining on natural images transfers useful priors for pattern recognition in plotted scientific data.
  • Future work could test whether different plotting conventions (log versus linear scales, color maps) change which spectral features the attention layers emphasize.
  • The approach may reduce the need for astronomy-specific encoders when large labeled spectral catalogs already exist.

Load-bearing premise

Plotting one-dimensional spectra as two-dimensional images preserves all scientifically relevant information without adding artifacts that harm the transformer's downstream performance.

What would settle it

Training an equivalent model directly on raw one-dimensional spectral vectors and finding it matches or exceeds the plotted-image ViT on the same classification and redshift tasks would falsify the necessity of the image conversion step.

Figures

Figures reproduced from arXiv: 2506.00294 by Guillermo Cabrera-Vives, Ignacio Becker, Luis Felipe Strano Moraes, Pavlos Protopapas.

Figure 1
Figure 1. Figure 1: Distribution of redshifts for the SDSS dataset [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of redshifts for the LAMOST dataset 1 snMedian is a value provided by SDSS to represent an overall SNR value for the object across the different filter bands 2 We computed snMedian as snMedian = qP i∈{u,g,r,i,z} SNR2 i , follow￾ing SDSS-derived practices. Article number, page 3 of 9 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of snMedian distributions in across both datasets via a log-scale boxplot. The central box spans the interquartile range (25th–75th percentiles), whiskers extend to the 5th and 95th percentiles, and outliers beyond this range are omitted for clarity. 3.1. SDSS SDSS, operational since 2000, has mapped millions of celes￾tial objects—including stars, galaxies, and quasars—using fiber￾optic spectrog… view at source ↗
Figure 4
Figure 4. Figure 4: Simple plot type of spectra for one of the SDSS objects, a starbust Galaxy with ID 9068120565953615872 [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Model pipeline example for a regression task: (a) data is obtained from surveys (b) processed and kept in local files (c) goes through gen￾eration of different plot types (d) passes through the ViT Base Model for finetuning. els, with a custom regression head. This head consists of a single linear layer that maps the CLS token output from the ViT to a single continuous value representing the predicted reds… view at source ↗
Figure 6
Figure 6. Figure 6: , we present a plot called Overlap, where we divide the spectrum into three segments of equal length and map each to a separate RGB color channel to determine if this approach yields any performance gains, due to allowing for larger detail of each section of the spectra to be shown in the final image. Since the usual plots tend to have mostly empty space, we further introduce a more information-dense appro… view at source ↗
Figure 7
Figure 7. Figure 7: 2D Map plot type of the spectra, where we associate each indi￾vidual wavelength with a square of 3x3 pixels in the final plot The final step of the pipeline involves using the generated 2D spectral images to fine-tune the base ViT model which was pretrained on large-scale image datasets to adapt its weights to spectral data. The fine-tuning step involves training the model on the spectral image dataset usi… view at source ↗
Figure 8
Figure 8. Figure 8: Overview of how each individual flux is mapped to the final 2D image in the 2D Map design. Labels a, b and c can be seen on the left in the standard flux plot, and in the right side with intensity set as the color of a given region in the image is validated using standard evaluation metrics such as accuracy, F1-score, and MSE. 6. Results In this section, we present the outcomes of our experiments and analy… view at source ↗
Figure 9
Figure 9. Figure 9: Residuals of model predictions displayed as boxplots across true redshift bins. Each box spans the interquartile range (25th–75th per￾centiles) of the residual distribution within that bin, whiskers extend to the 5th and 95th percentiles. Prediction results from regression with 2D Map over SLOMOST-Big 6.3. Results on stellar parameters Finally, we report R 2 for three stellar parameter regression tasks [P… view at source ↗
read the original abstract

We apply pre-trained Vision Transformers (ViTs), originally developed for image recognition, to the analysis of astronomical spectral data. By converting traditional one-dimensional spectra into two-dimensional image representations, we enable ViTs to capture both local and global spectral features through spatial self-attention. We fine-tune a ViT pretrained on ImageNet using millions of spectra from the SDSS and LAMOST surveys, represented as spectral plots. Our model is evaluated on key tasks including stellar object classification and redshift ($z$) estimation, where it demonstrates strong performance and scalability. We achieve classification accuracy higher than Support Vector Machines and Random Forests, and attain $R^2$ values comparable to AstroCLIP's spectrum encoder, even when generalizing across diverse object types. These results demonstrate the effectiveness of using pretrained vision models for spectroscopic data analysis. To our knowledge, this is the first application of ViTs to large-scale, which also leverages real spectroscopic data and does not rely on synthetic inputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes converting one-dimensional astronomical spectra into two-dimensional image plots to enable the use of pre-trained Vision Transformers (ViTs) for stellar object classification and redshift estimation. The authors fine-tune an ImageNet-pretrained ViT on millions of real spectra from the SDSS and LAMOST surveys and report higher classification accuracy than SVM and Random Forest baselines along with R² values comparable to AstroCLIP's spectrum encoder, with generalization across diverse object types. The work positions itself as the first application of ViTs to large-scale spectroscopic analysis using real rather than synthetic data.

Significance. If the performance claims hold under rigorous validation, the results indicate that off-the-shelf vision models can be adapted for spectroscopic tasks without custom architectures, offering a scalable route for processing large survey datasets. The emphasis on real observational data rather than simulations is a clear strength. However, the overall significance is limited by the absence of methodological specifics needed to assess robustness, particularly for the redshift regression task where precise feature localization matters.

major comments (2)
  1. [Abstract] Abstract: The reported classification accuracy gains over SVM/RF and the R² comparability to AstroCLIP are presented without any mention of train-test split ratios, cross-validation strategy, number of runs, or error bars. These omissions are load-bearing for the central generalization claim across object types and prevent independent verification of whether post-hoc choices inflated the results.
  2. [Methods] Methods (spectral-to-image conversion): The core assumption that rendering 1D spectra as 2D plots (with unspecified binning, axis scaling, resolution, and anti-aliasing) transmits all scientifically relevant information is not justified or ablated. For redshift estimation this is especially problematic, as line centroids must be localized to sub-pixel accuracy; standard plotting pipelines can introduce quantization or blurring that a ViT pretrained on natural images may not recover, undermining direct comparability to AstroCLIP.
minor comments (2)
  1. [Abstract] The abstract's claim of being 'the first application of ViTs to large-scale' spectroscopic analysis should be supported by a concise literature review in the introduction to contextualize novelty.
  2. [Figures] Figure captions and the methods description would benefit from explicit statements of image resolution, wavelength range mapping, and flux normalization choices to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of the manuscript. We provide point-by-point responses below and have revised the paper to address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The reported classification accuracy gains over SVM/RF and the R² comparability to AstroCLIP are presented without any mention of train-test split ratios, cross-validation strategy, number of runs, or error bars. These omissions are load-bearing for the central generalization claim across object types and prevent independent verification of whether post-hoc choices inflated the results.

    Authors: We agree that these details are necessary to support the generalization claims. In the revised manuscript we have added a new subsection in Methods that specifies the data split (80/20 train/test with stratification by object class), the use of 5-fold cross-validation, performance averaged over 10 independent runs, and the inclusion of standard deviations as error bars in all tables and figures. The abstract has been updated with a brief statement that results are reported as means with uncertainties from multiple runs. revision: yes

  2. Referee: [Methods] Methods (spectral-to-image conversion): The core assumption that rendering 1D spectra as 2D plots (with unspecified binning, axis scaling, resolution, and anti-aliasing) transmits all scientifically relevant information is not justified or ablated. For redshift estimation this is especially problematic, as line centroids must be localized to sub-pixel accuracy; standard plotting pipelines can introduce quantization or blurring that a ViT pretrained on natural images may not recover, undermining direct comparability to AstroCLIP.

    Authors: We acknowledge the need for greater methodological transparency. The revised Methods section now details the conversion pipeline: spectra are interpolated to a fixed 1000-point wavelength grid, plotted with linear wavelength scaling on the x-axis and logarithmic flux on the y-axis, rendered at 224×224 pixel resolution using anti-aliased lines, and saved as PNG images. We have also added an ablation study that varies binning density, resolution, and scaling choices and reports the resulting changes in classification accuracy and redshift R². While we recognize that 2D rendering may introduce minor quantization relative to native 1D encoders, the ablation shows that performance remains stable across reasonable parameter ranges and continues to match AstroCLIP levels, which we now discuss explicitly as a limitation. revision: yes

Circularity Check

0 steps flagged

Empirical ML application with external baselines exhibits no circularity

full rationale

The paper applies a pre-trained ImageNet ViT to 1D spectra rendered as 2D plots, then reports measured classification accuracy and redshift R² on SDSS/LAMOST data. These metrics are compared against independent external baselines (SVM, RF, AstroCLIP spectrum encoder) rather than being derived from or equivalent to any fitted parameters or self-referential definitions within the work. No equations, uniqueness theorems, or ansatzes reduce the central claims to the inputs by construction; the derivation chain consists of standard fine-tuning and evaluation steps that remain falsifiable against the cited external methods.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that spectral plots are a lossless or near-lossless representation for transformer attention and that ImageNet pre-training transfers useful features to astronomical spectra. No new physical entities or ad-hoc constants are introduced.

axioms (1)
  • domain assumption ImageNet-pretrained weights provide useful inductive bias for spectral plot images
    Invoked when the authors fine-tune a ViT pretrained on ImageNet rather than training from random initialization.

pith-pipeline@v0.9.0 · 5701 in / 1247 out tokens · 40168 ms · 2026-05-19T12:12:59.750109+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 2 internal anchors

  1. [1]

    2001, Mach

    Breiman, L. 2001, Mach. Learn., 45, 5–32

  2. [2]

    & Orton, G

    Burrows, A. & Orton, G. 2010, in Exoplanets, ed. S. Seager (University of Ari- zona Press, Tucson), 419–440

  3. [3]

    2024, &, 683, A42

    Cao, Jie, Xu, Tingting, Deng, Yuhe, et al. 2024, &, 683, A42

  4. [4]

    2021, in 2021 IEEE /CVF International Conference on Computer Vision (ICCV), 9630–9640

    Caron, M., Touvron, H., Misra, I., et al. 2021, in 2021 IEEE /CVF International Conference on Computer Vision (ICCV), 9630–9640

  5. [5]

    J., Moles, M., Cristóbal-Hornillos, D., et al

    Cenarro, A. J., Moles, M., Cristóbal-Hornillos, D., et al. 2019, A&A, 622, A176

  6. [6]

    2001, Monthly Notices of the Royal Astronomical Society, 328, 1039

    Colless, M., Dalton, G., Maddox, S., et al. 2001, Monthly Notices of the Royal Astronomical Society, 328, 1039

  7. [7]

    & Vapnik, V

    Cortes, C. & Vapnik, V . 1995, Mach. Learn., 20, 273–297

  8. [8]

    L., Sordo, R., Pailler, F., et al

    Creevey, O. L., Sordo, R., Pailler, F., et al. 2023, A&A, 674, A26

  9. [9]

    2025, A&A, 693, A95 de Jong, R

    Daoutis, C., Zezas, A., Kyritsis, E., Kouroumpatzakis, K., & Bonfini, P. 2025, A&A, 693, A95 de Jong, R. S., Barden, S. C., Bellido-Tirado, O., et al. 2016, in Ground-based and Airborne Instrumentation for Astronomy VI, ed. C. J. Evans, L. Simard, & H. Takami, V ol. 9908, International Society for Optics and Photonics (SPIE), 99081O

  10. [10]

    2009, in 2009 IEEE Conference on Com- puter Vision and Pattern Recognition, 248–255

    Deng, J., Dong, W., Socher, R., et al. 2009, in 2009 IEEE Conference on Com- puter Vision and Pattern Recognition, 248–255

  11. [11]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. 2018, arXiv e-prints, arXiv:1810.04805

  12. [12]

    2023, A&A, 670, A54

    Donoso-Oliva, C., Becker, I., Protopapas, P., et al. 2023, A&A, 670, A54

  13. [13]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. 2020, arXiv e-prints, arXiv:2010.11929

  14. [14]

    Frontera-Pons, J., Sureau, F., Moraes, B., Bobin, J., & Abdalla, F. B. 2019, A&A, 625, A73 Gaia Collaboration, Prusti, T., de Bruijne, J. H. J., et al. 2016, A&A, 595, A1

  15. [15]

    Gray, D. F. 2005, The Observation and Analysis of Stellar Photospheres, 3rd edn. (Cambridge, UK: Cambridge University Press)

  16. [16]

    J., Ruiz-Macias, O., et al

    Hahn, C., Wilson, M. J., Ruiz-Macias, O., et al. 2023, AJ, 165, 253

  17. [17]

    & Schmidhuber, J

    Hochreiter, S. & Schmidhuber, J. 1997, Neural Comput., 9, 1735–1780

  18. [18]

    1929, Proceedings of the National Academy of Sciences, 15, 168

    Hubble, E. 1929, Proceedings of the National Academy of Sciences, 15, 168

  19. [19]

    F., Blanc, G

    Kollmeier, J., Anderson, S. F., Blanc, G. A., et al. 2019, in Bulletin of the Amer- ican Astronomical Society, V ol. 51, 274

  20. [20]

    Krizhevsky, A., Sutskever, I., & Hinton, G. E. 2012, in Advances in Neural In- formation Processing Systems, V ol. 25, 1097–1105 Kügler, S. D., Polsterer, K., & Hoecker, M. 2015, A&A, 576, A132

  21. [21]

    1998, Proceedings of the IEEE, 86, 2278

    LeCun, Y ., Bottou, L., Bengio, Y ., & Haffner, P. 1998, Proceedings of the IEEE, 86, 2278

  22. [22]

    2015, Research in Astronomy and Astro- physics, 15, 1095

    Luo, A.-L., Zhao, Y ., Gang, Z., et al. 2015, Research in Astronomy and Astro- physics, 15, 1095

  23. [23]

    P., Leonard, A., Starck, J

    Machado, D. P., Leonard, A., Starck, J. L., Abdalla, F. B., & Jouvel, S. 2013, A&A, 560, A83

  24. [24]

    & Mannucci, F

    Maiolino, R. & Mannucci, F. 2019, Astronomy and Astrophysics Review, 27, 3

  25. [25]

    R., Guzzo, L., et al

    Marchetti, A., Granett, B. R., Guzzo, L., et al. 2012, Monthly Notices of the Royal Astronomical Society, 428, 1424

  26. [26]

    2024, Monthly Notices of the Royal Astronomical Society, 531, 4990

    Parker, L., Lanusse, F., Golkar, S., et al. 2024, Monthly Notices of the Royal Astronomical Society, 531, 4990

  27. [27]

    2022, Astronomy and Computing, 40, 100615 Pâris, Isabelle, Petitjean, Patrick, Ross, Nicholas P., et al

    Podsztavek, O., Škoda, P., & Tvrdík, P. 2022, Astronomy and Computing, 40, 100615 Pâris, Isabelle, Petitjean, Patrick, Ross, Nicholas P., et al. 2017, A&A, 597, A79

  28. [28]

    Radford, A., Wu, J., Child, R., et al. 2019

  29. [29]

    J., Bautista, J., Tojeiro, R., et al

    Ross, A. J., Bautista, J., Tojeiro, R., et al. 2020, MNRAS, 498, 2354

  30. [30]

    2021, in American Astronomical So- ciety Meeting Abstracts, V ol

    Schlegel, D., Dey, A., Herrera, D., et al. 2021, in American Astronomical So- ciety Meeting Abstracts, V ol. 237, American Astronomical Society Meeting Abstracts, 235.03

  31. [31]

    2018, A&A, 609, A84

    Scodeggio, M., Guzzo, L., Garilli, B., et al. 2018, A&A, 609, A84

  32. [32]

    C., Sobeck, J., & the MSE Team

    Sheinis, A., Barden, S. C., Sobeck, J., & the MSE Team. 2023, Astronomische Nachrichten, 344, e20230108

  33. [33]

    & Zisserman, A

    Simonyan, K. & Zisserman, A. 2015, in International Conference on Learning Representations

  34. [34]

    H., Bernardi, M., et al

    Stoughton, C., Lupton, R. H., Bernardi, M., et al. 2002, 123, 485

  35. [35]

    2017, in Advances in Neural Informa- tion Processing Systems, ed

    Vaswani, A., Shazeer, N., Parmar, N., et al. 2017, in Advances in Neural Informa- tion Processing Systems, ed. I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett, V ol. 30 (Curran Associates, Inc.)

  36. [36]

    B., Dobrycheva, D

    Vavilova, I. B., Dobrycheva, D. V ., Vasylenko, M. Y ., et al. 2021, A&A, 648, A122 Véron-Cetty, M. P. & Véron, P. 2010, A&A, 518, A10

  37. [37]

    Wahlgren, G. M. 2011, Canadian Journal of Physics, 89, 345

  38. [38]

    2022, A&A, 659, A144

    Wang, C., Bai, Y ., López-Sanjuan, C., et al. 2022, A&A, 659, A144

  39. [39]

    W., Lintott, C

    Willett, K. W., Lintott, C. J., Bamford, S. P., et al. 2013, Monthly Notices of the Royal Astronomical Society, 435, 2835

  40. [40]

    M., Gawiser, E., & Prochaska, J

    Wolfe, A. M., Gawiser, E., & Prochaska, J. X. 2005, Annual Review of Astron- omy and Astrophysics, 43, 861

  41. [41]

    2018, The Astrophysical Journal Supplement Series, 234, 5 Article number, page 9 of 9

    Yang, M., Wu, H., Yang, F., et al. 2018, The Astrophysical Journal Supplement Series, 234, 5 Article number, page 9 of 9