The Galaxy's Guide to the Tokenizer: A Benchmark for Scientific Foundation Models
Pith reviewed 2026-06-25 20:28 UTC · model grok-4.3
The pith
No single tokenization strategy excels at both reconstructing galaxy images and predicting their physical properties.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When four tokenization methods are plugged into the same transformer backbone and tested on galaxy image reconstruction plus physical property prediction, reconstruction quality and representation quality turn out to be decoupled, and no tokenizer dominates every task.
What carries the argument
A shared AstroPT transformer backbone that processes galaxy images tokenized by Affine, AIM, JetFormer, or VQ-VAE, evaluated on both reconstruction error and downstream physical property probes.
Load-bearing premise
That the shared AstroPT backbone and the chosen physical-property prediction tasks provide a fair, unbiased comparison of the four tokenization strategies without method-specific confounding effects.
What would settle it
Finding that one of the four tokenizers achieves both the highest reconstruction fidelity and the best physical property prediction accuracy across the full set of galaxy images.
Figures
read the original abstract
Tokenization is central to adapting scientific data for transformer-based foundation models, yet its impact on learned representations remains poorly understood. We compare four tokenization strategies, Affine, AIM, JetFormer, and VQ-VAE, within a unified transformer framework for astronomical imaging. Using 640,000 galaxy images from the DESI Legacy Survey and a shared AstroPT backbone, we evaluate each method on reconstruction fidelity and prediction of physical properties. Our results reveal trade-offs across approaches. The flow-based JetFormer achieves higher reconstruction quality, while VQ-VAE yields strong probe performance for galaxy physical properties. Affine and AIM better preserve localized morphological information. We find that reconstruction and representation quality are decoupled, and no single method consistently performs best across the tasks considered here. By grounding our evaluation in independently measured physical quantities, we hope this study serves to highlight the potential of scientific data as a basis for constructing interpretable benchmarks for foundation models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper compares four tokenization strategies (Affine, AIM, JetFormer, VQ-VAE) for astronomical imaging within a unified transformer framework using 640,000 galaxy images from the DESI Legacy Survey and a shared AstroPT backbone. It evaluates each on reconstruction fidelity and prediction of physical properties, reports trade-offs (e.g., JetFormer higher reconstruction, VQ-VAE stronger probe performance, Affine/AIM better for morphology), and concludes that reconstruction and representation quality are decoupled with no single method best across tasks. The evaluation is grounded in independently measured physical quantities to create interpretable benchmarks for scientific foundation models.
Significance. If the decoupling result holds under a properly controlled comparison, the work supplies a concrete, physics-grounded benchmark for tokenizer choice in scientific transformers. The explicit use of independently measured physical properties as probes, rather than proxy metrics, is a clear strength that could help move the field toward falsifiable, domain-specific evaluations of foundation-model components.
major comments (2)
- [Methods] Methods (shared AstroPT backbone description): the claim that differences in reconstruction vs. physical-property probe performance reflect tokenizer properties alone is load-bearing for the decoupling conclusion, yet the text does not specify whether the shared backbone weights are re-initialized or re-trained from scratch for each tokenizer or whether a single pre-trained checkpoint is reused. If the latter, performance gaps could arise from tokenizer–backbone mismatch rather than intrinsic tokenizer quality, undermining the central empirical comparison.
- [Results] Results (probe-task evaluation): the abstract states that VQ-VAE yields strong probe performance and that reconstruction and representation quality are decoupled, but no error bars, statistical significance tests, or data-selection criteria for the 640k images are reported. Without these, it is impossible to assess whether the observed trade-offs are robust or whether the decoupling claim is supported by the data.
minor comments (2)
- [Abstract] Abstract: the sentence 'no single method consistently performs best across the tasks considered here' would be clearer if the specific tasks (reconstruction, morphology preservation, physical-property prediction) were enumerated.
- The manuscript would benefit from a table summarizing the four tokenizers' key hyperparameters and training schedules to allow direct comparison.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important points for clarification. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Methods] Methods (shared AstroPT backbone description): the claim that differences in reconstruction vs. physical-property probe performance reflect tokenizer properties alone is load-bearing for the decoupling conclusion, yet the text does not specify whether the shared backbone weights are re-initialized or re-trained from scratch for each tokenizer or whether a single pre-trained checkpoint is reused. If the latter, performance gaps could arise from tokenizer–backbone mismatch rather than intrinsic tokenizer quality, undermining the central empirical comparison.
Authors: We thank the referee for this observation. The shared AstroPT backbone was implemented by using identical architecture and hyperparameters for each tokenizer, with the backbone weights randomly initialized and trained from scratch independently in each case. This design ensures that observed differences can be attributed to the tokenizers rather than to a pre-trained checkpoint mismatch. We will revise the Methods section to explicitly describe the initialization, training procedure, and hyperparameter sharing to remove any ambiguity. revision: yes
-
Referee: [Results] Results (probe-task evaluation): the abstract states that VQ-VAE yields strong probe performance and that reconstruction and representation quality are decoupled, but no error bars, statistical significance tests, or data-selection criteria for the 640k images are reported. Without these, it is impossible to assess whether the observed trade-offs are robust or whether the decoupling claim is supported by the data.
Authors: We agree that error bars, significance testing, and explicit data-selection criteria are necessary to substantiate the robustness of the reported trade-offs and decoupling result. In the revised manuscript we will add bootstrap-derived error bars to all probe-task metrics, include statistical significance tests (e.g., paired t-tests) comparing tokenizer performances, and provide a clear description of the selection criteria applied to the 640,000 DESI Legacy Survey images. These additions will directly support the claims in the abstract and results. revision: yes
Circularity Check
No circularity: empirical benchmark with independent physical-property ground truth
full rationale
The paper reports an empirical comparison of four tokenizers (Affine, AIM, JetFormer, VQ-VAE) on 640k DESI galaxy images using a shared AstroPT backbone. Performance is measured on reconstruction fidelity and prediction of independently measured physical properties. No equations, parameter fits, or derivations appear; the decoupling claim is an observed outcome across tasks rather than a quantity defined in terms of itself. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. The setup is self-contained against external benchmarks (real galaxy properties), satisfying the criteria for score 0.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Amiaux, J., Scaramella, R., Mellier, Y ., Altieri, B., Burig- ana, C., Da Silva, A., Gomez, P., Hoar, J., Laureijs, R., Maiorano, E., Magalh˜aes Oliveira, D., Renk, F., Saavedra Criado, G., Tereno, I., Augu`eres, J. L., Brinchmann, J., Cropper, M., Duvet, L., Ealet, A., Franzetti, P., Garilli, B., Gondoin, P., Guzzo, L., Hoekstra, H., Holmes, R., Jahnke, ...
2012
-
[2]
doi: 10.1117/12.926513. Black Forest Labs. FLUX.2: Analyzing and enhanc- ing the latent space of FLUX – representation compar- ison,
-
[4]
ISSN 1538-3881. doi: 10.3847/1538-3881/ab089d. El-Nouby, A., Klein, M., Zhai, S., Bautista, M. A., Toshev, A., Shankar, V ., Susskind, J. M., and Joulin, A. Scal- able Pre-training of Large Autoregressive Image Models. ArXiv e-prints,
-
[5]
Esser, P., Rombach, R., and Ommer, B
doi: 10.48550/arXiv.2401.08541. Esser, P., Rombach, R., and Ommer, B. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12873–12883,
-
[6]
He, K., Zhang, X., Ren, S., and Sun, J
doi: 10.48550/arXiv.2503.15312. He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for Image Recognition.arXiv,
-
[7]
He, K., Zhang, X., Ren, S., and Sun, J
48550/arXiv.1512.03385. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- ing for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778,
-
[8]
URL https: //doi.org/10.3847/1538-4357/ae38b8
doi: 10.3847/1538-4357/ae38b8. URL https: //doi.org/10.3847/1538-4357/ae38b8. Ivezi´c, ˇZ., Kahn, S. M., Tyson, J. A., Abel, B., Acosta, E., Allsman, R., Alonso, D., AlSayyad, Y ., Anderson, S. F., Andrew, J., et al. LSST: From Science Drivers to Reference Design and Anticipated Data Products.The Astrophysical Journal, 873:111,
- [9]
-
[10]
ISSN 0035-8711. doi: 10.1093/mnras/stad3015. Loshchilov, I. and Hutter, F. Decoupled Weight Decay Regularization.ArXiv e-prints,
-
[11]
McInnes, L., Healy, J., and Melville, J
doi: 10.48550/ arXiv.1711.05101. McInnes, L., Healy, J., and Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.ArXiv e-prints,
-
[12]
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
doi: 10.48550/arXiv. 1802.03426. Mousavi, P., Maimon, G., Moumen, A., Petermann, D., Shi, J., Wu, H., Yang, H., Kuznetsova, A., Ploujnikov, A., Marxer, R., Ramabhadran, B., Elizalde, B., Lugosch, L., Li, J., Subakan, C., Woodland, P., Kim, M., Lee, H.-y., Watanabe, S., Adi, Y ., and Ravanelli, M. Discrete Audio Tokens: More Than a Survey!ArXiv e-prints,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv
-
[13]
doi: 10.48550/arXiv.2506.10274. Parker, L., Lanusse, F., Shen, J., Liu, O., Hehir, T., Sarra, L., Meyer, L., Bowles, M., Wagner-Carena, S., Qu, H., Golkar, S., Bietti, A., Bourfoune, H., Casserau, N., Cornette, P., Hirashima, K., Krawezik, G., Ohana, R., Lourie, N., McCabe, M., Morel, R., Mukhopadhyay, P., Pettee, M., Blancard, B. R.-S., Cho, K., Cranmer,...
-
[14]
doi: 10.48550/arXiv.2510.17960. Pearson, K. LIII. On lines and planes of closest fit to systems of points in space.London, Edinburgh, and Dublin Philosophical Magazine and Journal of Sci- ence, 2(11):559–572,
-
[15]
ISSN 1941-5982. doi: 10.1080/14786440109462720. Radford, A., Wu, J., Child, R., Luan, D., A., D., and Sutskever, I. Language models are unsuper- vised multitask learners.OpenAI Whitepaper,
-
[16]
ISSN 1538-3881. doi: 10.3847/1538-3881/acb213. Sanjaripour, S., Hemmati, S., Mobasher, B., Canalizo, G., Barish, B. C., Shivaei, I., Coil, A. L., Chartab, N., Jafariyazani, M., Reddy, N. A., and Azadi, M. The application of manifold learning to a selection of dif- ferent galaxy populations and scaling relation analy- sis.The Astrophysical Journal, 977(2):202, dec
-
[17]
doi: 10.3847/1538-4357/ad90ba. URL https://doi. org/10.3847/1538-4357/ad90ba. Sanjaripour, S., Aravindan, A., Canalizo, G., Hemmati, S., Mobasher, B., Coil, A. L., and Barish, B. C. Selec- tion of dwarf galaxies hosting active galactic nuclei: A measure of bias and contamination using unsupervised machine learning techniques.The Astrophysical Journal, 992...
-
[18]
URL https://doi.org/10.3847/1538-4357/ ae0326
doi: 10.3847/1538-4357/ae0326. URL https://doi.org/10.3847/1538-4357/ ae0326. Smith, M. J. and Geach, J. E. Astronomia ex machina: a his- tory, primer and outlook on neural networks in astronomy. R. Soc. Open Sci., 10(5):221454,
-
[19]
ISSN 2054-5703. doi: 10.1098/rsos.221454. Smith, M. J., Roberts, R. J., Angeloudi, E., and Huertas- Company, M. AstroPT: Scaling Large Observation Models for Astronomy.ArXiv e-prints,
-
[20]
Strubell, E., Ganesh, A., and Mccallum, A
doi: 10.48550/arXiv.2405.14930. Strubell, E., Ganesh, A., and Mccallum, A. Energy and Policy Considerations for Deep Learning in NLP.ACL Anthology, pp. 3645–3650,
-
[21]
CoLLaVO: Crayon large language and vision mOdel
doi: 10.18653/v1/ P19-1355. The Multimodal Universe Collaboration. The Multimodal Universe: Enabling Large-Scale Machine Learning with 100 TB of Astronomical Scientific Data.Advances in Neu- ral Information Processing Systems, 37:57841–57913,
-
[22]
doi: 10.48550/arXiv.2411. 19722. van den Oord, A., Vinyals, O., and Kavukcuoglu, K. Neu- ral discrete representation learning. InNeural Informa- tion Processing Systems,
-
[23]
doi: 10.48550/arXiv.1706. 03762. Walmsley, M., G´eron, T., Kruk, S., Scaife, A. M. M., Lintott, C., Masters, K. L., Dawson, J. M., Dickinson, H., Fortson, L., Garland, I. L., et al. Galaxy Zoo DESI: Detailed morphology measurements for 8.7M galaxies in the DESI Legacy Imaging Surveys.Monthly Notices of the Royal Astronomical Society, 526(3):4768–4786,
-
[24]
Vector-quantized Image Modeling with Improved VQGAN
ISSN 0035-8711. doi: 10.1093/mnras/stad2919. Yu, J., Li, X., Koh, J. Y ., Zhang, H., Pang, R., Qin, J., Ku, A., Xu, Y ., Baldridge, J., and Wu, Y . Vector-quantized image modeling with improved vqgan.arXiv preprint arXiv:2110.04627,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1093/mnras/stad2919
-
[25]
Embedding Structure via PCA and UMAP We project the 768-dimensional embeddings onto two dimensions using PCA and UMAP analysis (Pearson, 1901; McInnes et al., 2018)
8 The Galaxy’s Guide to the Tokenizer A. Embedding Structure via PCA and UMAP We project the 768-dimensional embeddings onto two dimensions using PCA and UMAP analysis (Pearson, 1901; McInnes et al., 2018). The resulting projections are colour-coded by photometric redshift, g−r colour, r-band magnitude, and smoothness fraction as measured by Galaxy Zoo Ci...
1901
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.