pith. machine review for the scientific record. sign in

arxiv: 2604.24589 · v1 · submitted 2026-04-27 · 💻 cs.AI · astro-ph.GA· astro-ph.IM

Recognition: unknown

A systematic evaluation of vision-language models for observational astronomical reasoning tasks

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:38 UTC · model grok-4.3

classification 💻 cs.AI astro-ph.GAastro-ph.IM
keywords vision-language modelsastronomical data analysisbenchmark evaluationphysical groundingobservational astronomymulti-modal reasoningAstroVLBench
0
0 comments X

The pith

Vision-language models underperform specialized astronomical tools because they fail to ground visual features in physical knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates AstroVLBench with more than 4,100 expert-checked examples across five real-world astronomy tasks that use optical images, radio maps, light curves, spectra, and multi-wavelength data. It tests six leading vision-language models and shows they trail domain-specific methods on every task, with results varying sharply by data type. Ablation experiments demonstrate that models improve when prompts explain the underlying physics rather than simply describing visible features, and when raw numbers replace rendered plots. The authors conclude that correct final answers can still rest on imprecise physical reasoning, making accuracy alone unreliable for scientific work.

Core claim

Current vision-language models, even the strongest ones, substantially underperform domain-specialized methods on observational astronomy tasks spanning multiple modalities. Performance improves when models receive physical explanations of why features matter instead of only phenomenological descriptions of what to look for, and when one-dimensional measurements are supplied directly as tables rather than as images. Without explicit physical grounding, models can reach correct predictions from plausible visual cues while offering imprecise justifications, showing that accuracy by itself is not sufficient for trustworthy scientific use.

What carries the argument

Mechanistic ablations that separate attention to salient visual features from grounding those features in physical knowledge, tested via phenomenological versus physical prompts and plots versus numerical tables.

If this is right

  • Physical prompts reduce class-specific bias and improve balanced performance compared with purely descriptive prompts.
  • Supplying raw numerical data instead of rendered plots raises accuracy by as much as 13 percentage points.
  • Models can arrive at correct answers through visually plausible but physically imprecise routes, limiting safe deployment.
  • Task-specific strengths vary across models, so no single model dominates every astronomical modality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future models may need built-in access to physical simulators or knowledge graphs to handle astronomy reliably.
  • The same grounding gap is likely to appear in other observational sciences that combine images with quantitative measurements.
  • Benchmarks like this could be extended with harder reasoning chains that require chaining multiple physical principles.

Load-bearing premise

The five tasks and 4,100 expert-verified instances capture the full range of observational astronomical reasoning that scientists actually perform.

What would settle it

A vision-language model that matches or exceeds the accuracy of specialized methods on all five tasks while also producing step-by-step justifications that match expert physical reasoning would falsify the claim.

read the original abstract

Vision-language models (VLMs) are increasingly proposed as general-purpose tools for scientific data interpretation, yet their reliability on real astronomical observations across diverse modalities remains untested. We present AstroVLBench, a comprehensive benchmark comprising over 4,100 expert-verified instances across five tasks spanning optical imaging, radio interferometry, multi-wavelength photometry, time-domain light curves, and optical spectroscopy. Evaluating six frontier models, we find that performance is strongly modality-dependent: while one model (Gemini 3 Pro) emerges as the most consistently capable across tasks, task-specific strengths vary, and all models substantially underperform domain-specialized methods. Mechanistic ablations reveal that performance depends not only on directing attention to salient visual features but also on grounding those features in physical knowledge. Phenomenological prompts describing what to look for improve accuracy by sharpening model focus, but physical prompts explaining why those features matter perform better overall and yield more balanced classifications with reduced class-specific bias. Consistent with this picture, presenting the underlying one-dimensional measurements directly as numerical tables instead of rendered plots yields up to 13 percentage points improvement. Reasoning quality analysis further demonstrates that, without explicit physical grounding, models may reach correct predictions from phenomenologically plausible cues while providing physically imprecise justifications, establishing that accuracy alone is insufficient for trustworthy scientific deployment. These findings provide the first systematic, multi-modal baselines for VLMs in observational astronomy and identify the specific representation, grounding, and reasoning bottlenecks where current models fail.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AstroVLBench, a benchmark of over 4,100 expert-verified instances spanning five observational astronomy tasks (optical imaging, radio interferometry, multi-wavelength photometry, time-domain light curves, and optical spectroscopy). It evaluates six frontier VLMs, reports modality-dependent performance with all models substantially underperforming domain-specialized methods, and uses mechanistic ablations to show that physical grounding prompts and direct numerical tables (vs. plots) improve accuracy and reduce bias, concluding that accuracy alone is insufficient for trustworthy scientific deployment.

Significance. If the benchmark holds as a faithful proxy, the work supplies the first systematic multi-modal baselines for VLMs in observational astronomy and isolates concrete bottlenecks in visual attention, physical knowledge integration, and reasoning precision. The ablations (phenomenological vs. physical prompts; plot vs. table inputs) and direct comparisons to specialized pipelines provide actionable evidence that could inform prompt design, fine-tuning, and deployment decisions in scientific AI applications.

major comments (2)
  1. The central claim that VLMs substantially underperform specialized methods and that physical grounding is required for trustworthy use rests on AstroVLBench serving as a representative proxy. The five tasks cover key modalities with expert verification, yet the manuscript does not explicitly address whether rarer integrative patterns (cross-instrument fusion, systematic artifact handling, or hypothesis-driven follow-up) are adequately sampled; if underrepresented, the observed gaps and the 'accuracy alone is insufficient' conclusion could be narrower than stated.
  2. The abstract states that presenting one-dimensional measurements as numerical tables yields 'up to 13 percentage points improvement.' The specific task(s), model(s), and statistical details supporting this figure (including variance across runs or instances) are needed to evaluate how load-bearing this result is for the mechanistic claim about representation bottlenecks.
minor comments (2)
  1. Clarify in the methods section how the domain-specialized baselines were selected and implemented for each task to ensure the performance gaps are directly comparable.
  2. Add a short limitations paragraph discussing the scope of the five tasks relative to the broader space of observational reasoning workflows.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive recommendation of minor revision and for the constructive comments. We address each major comment below and will incorporate clarifications to strengthen the manuscript.

read point-by-point responses
  1. Referee: The central claim that VLMs substantially underperform specialized methods and that physical grounding is required for trustworthy use rests on AstroVLBench serving as a representative proxy. The five tasks cover key modalities with expert verification, yet the manuscript does not explicitly address whether rarer integrative patterns (cross-instrument fusion, systematic artifact handling, or hypothesis-driven follow-up) are adequately sampled; if underrepresented, the observed gaps and the 'accuracy alone is insufficient' conclusion could be narrower than stated.

    Authors: We agree that an explicit discussion of the benchmark's scope would improve the manuscript. AstroVLBench was designed to provide systematic, expert-verified coverage of five core observational modalities, but as noted, it does not explicitly include rarer integrative patterns such as cross-instrument fusion, systematic artifact handling across instruments, or hypothesis-driven follow-up. In the revised version, we will add a dedicated paragraph in the Discussion section acknowledging this limitation, clarifying that the reported performance gaps and the conclusion that accuracy alone is insufficient apply to the evaluated tasks and modalities, and identifying broader integrative reasoning as an important direction for future work. revision: yes

  2. Referee: The abstract states that presenting one-dimensional measurements as numerical tables yields 'up to 13 percentage points improvement.' The specific task(s), model(s), and statistical details supporting this figure (including variance across runs or instances) are needed to evaluate how load-bearing this result is for the mechanistic claim about representation bottlenecks.

    Authors: We thank the referee for this request for greater transparency. The details underlying the 'up to 13 percentage points' figure are reported in the ablation studies (Section 4.3 and associated figures/tables), which compare plot versus table inputs across tasks and models. To address the concern, we will revise the abstract to specify the task and model achieving the maximum improvement and will add a supplementary table (with cross-reference in the main text) reporting per-model and per-task differences along with variance measures across instances and prompt variations. This will allow readers to better assess the robustness of the result for the mechanistic claims regarding representation bottlenecks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with independent results

full rationale

This is a pure empirical benchmark paper that constructs AstroVLBench (4100 expert-verified instances across five modalities) and reports direct model evaluations plus prompt ablations. No equations, parameter fits, or derivations are claimed; all performance numbers and mechanistic conclusions are obtained by running the six VLMs on held-out test instances and comparing against domain-specialized baselines. The representativeness concern raised by the skeptic is a question of external validity, not a reduction of any result to its own inputs by construction. No self-citation load-bearing steps, self-definitional loops, or fitted-input predictions exist in the reported chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the constructed benchmark tasks and expert verification process capture genuine astronomical reasoning demands without introducing selection bias toward model weaknesses.

axioms (1)
  • domain assumption Expert-verified instances accurately reflect real observational astronomy reasoning tasks across the five modalities
    Stated in the abstract as the basis for the 4,100 instances; no further validation details provided.

pith-pipeline@v0.9.0 · 5569 in / 1327 out tokens · 63252 ms · 2026-05-08T03:38:15.778733+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 35 canonical work pages · 4 internal anchors

  1. [1]

    Abdul-Karim, A

    DESI Collaboration, M. Abdul-Karim, A. G. Adame, D. Aguado, J. Aguilar, S. Ahlen, S. Alam, G. Aldering, D. M. Alexander, R. Alfarsy, L. Allen, C. Allende Prieto, O. Alves, A. Anand, U. Andrade, E. Armengaud, S. Avila, A. Aviles, H. Awan, S. Bailey, A. Baleato Lizancos, O. Ballester, A. Bault, J. Bautista, S. BenZvi, L. Beraldo e Silva, J. R. Bermejo-Clime...

  2. [2]

    doi: 10.48550/ARXIV.2503.14745

  3. [3]

    The wide-field infrared survey explorer (wise): mission description and initial on-orbit performance.The Astronomical Journal, 140(6):1868–1881, 2010

    Edward L Wright, Peter RM Eisenhardt, Amy K Mainzer, Michael E Ressler, Roc M Cutri, Thomas Jarrett, J Davy Kirkpatrick, Deborah Padgett, Robert S McMillan, Michael Skrutskie, et al. The wide-field infrared survey explorer (wise): mission description and initial on-orbit performance.The Astronomical Journal, 140(6):1868–1881, 2010

  4. [4]

    H., White, R

    Robert H. Becker, Richard L. White, and David J. Helfand. The first survey: Faint images of the radio sky at twenty centimeters.The Astrophysical Journal, 450:559, September 1995. ISSN 1538-4357. doi: 10.1086/176166. 18 AstroVLBench

  5. [5]

    C., Kulkarni, S

    Eric C. Bellm, Shrinivas R. Kulkarni, Matthew J. Graham, Richard Dekany, Roger M. Smith, Reed Riddle, et al. The Zwicky Transient Facility: System overview, performance, and first results.Publications of the Astronomical Society of the Pacific, 131:018002, 2019. doi: 10.1088/1538-3873/aaecbe

  6. [6]

    Željko Ivezić, Steven M. Kahn, J. Anthony Tyson, Bob Abel, Emily Acosta, Robyn Allsman, David Alonso, Yusra AlSayyad, Scott F. Anderson, John Andrew, et al. LSST: From science drivers to reference design and anticipated data products.The Astrophysical Journal, 873:111, 2019. doi: 10.3847/1538-4357/ab042c

  7. [7]

    Aharonian et al.,Pathway to the Square Kilometre Array - The German White Paper -, 1301.4124

    F. Aharonian, T. G. Arshakian, B. Allen, R. Banerjee, R. Beck, W. Becker, D. J. Bomans, D. Breitschwerdt, M. Brüggen, A. Brunthaler, B. Catinella, D. Champion, B. Ciardi, R. Crocker, M. A. de Avillez, R. J. Dettmar, D. Engels, T. Enßlin, H. Enke, T. Fieseler, L. Gizon, E. Hackmann, B. Hartmann, C. Henkel, M. Hoeft, L. Iapichino, D. Innes, C. James, J. Jas...

  8. [8]

    arXiv e-prints , keywords =

    Roman Observations Time Allocation Committee and Core Community Survey Definition Committees. Roman Observations Time Allocation Committee: Final Report and Recommendations.arXiv e-prints, art. arXiv:2505.10574, May 2025. doi: 10.48550/arXiv.2505.10574

  9. [9]

    Willett, and Joni Dambre

    Sander Dieleman, Kyle W. Willett, and Joni Dambre. Rotation-invariant convolutional neural networks for galaxy morphology prediction.Monthly Notices of the Royal Astronomical Society, 450(2):1441–1459,

  10. [10]

    doi: 10.1093/mnras/stv632

  11. [11]

    2022, MNRAS, 509, 3966, doi: 10.1093/mnras/stab2093

    Mike Walmsley, Chris Lintott, Tobias Géron, Sandor Kruk, Coleman Krawczyk, Kyle W. Willett, Steven Bamford, Lee S. Kelvin, Lucy Fortson, et al. Galaxy zoo decals: Detailed visual morphology measure- ments from volunteers and deep learning for 314000 galaxies.Monthly Notices of the Royal Astronomical Society, 509(3):3966–3988, 2022. doi: 10.1093/mnras/stab2093

  12. [12]

    , keywords =

    Anais Möller and Thibault de Boissière. SuperNNova: an open-source framework for Bayesian, neural network-based supernova classification.Monthly Notices of the Royal Astronomical Society, 491(3): 4277–4293, 2020. doi: 10.1093/mnras/stz3312

  13. [13]

    Sánchez-Sáez, I

    P. Sánchez-Sáez, I. Reyes, C. Valenzuela, F. Förster, S. Eyheramendy, F. Elorrieta, F. E. Bauer, G. Cabrera- Vives, P. A. Estévez, M. Catelan, G. Pignata, P. Huijse, D. De Cicco, P. Arévalo, R. Carrasco-Davis, J. Abril, R. Kurtev, J. Borissova, J. Arredondo, E. Castillo-Navarrete, D. Rodriguez, D. Ruz-Mieres, A. Moya, L. Sabatini-Gacitúa, C. Sepúlveda-Cob...

  14. [14]

    CIGALE: a python Code Investigating GALaxy Emission

    Médéric Boquien, Denis Burgarella, Yannick Roehlly, Véronique Buat, Laure Ciesla, Denis Corre, Alberto K. Iber, and Evanthia Hatziminaoglou. CIGALE: a python code investigating GALaxy emission. Astronomy & Astrophysics, 622:A103, 2019. doi: 10.1051/0004-6361/201834156. 19 AstroVLBench

  15. [15]

    Abdollahi, et al.,FermiLarge Area Telescope Fourth Source Catalog, Astrophys

    BenjaminD.Johnson, JoelLeja, CharlieConroy, andJoshuaS.Speagle. Stellarpopulationinferencewith Prospector.The Astrophysical Journal Supplement Series, 254(2):22, 2021. doi: 10.3847/1538-4365/ abef67

  16. [16]

    Parker, F

    Liam Parker, Francois Lanusse, Jeff Shen, Ollie Liu, Tom Hehir, Leopoldo Sarra, Lucas Meyer, Micah Bowles, Sebastian Wagner-Carena, Helen Qu, Siavash Golkar, Alberto Bietti, Hatim Bourfoune, Nathan Casserau, Pierre Cornette, Keiya Hirashima, Geraud Krawezik, Ruben Ohana, Nicholas Lourie, Michael McCabe, Rudy Morel, Payel Mukhopadhyay, Mariel Pettee, Bruno...

  17. [17]

    Deepdisc-euclid: Source classification and photometric redshifts in euclid deep field north with a pixel-level deep learning approach

    Yuanzhe Jiang, Yue Shen, Grant Merz, Shurui Lin, Xin Liu, Zhiwei Pan, Mingyang Zhuang, William Roster, Mara Salvato, Malgorzata Siudek, and Grant Stevens. Deepdisc-euclid: Source classification and photometric redshifts in euclid deep field north with a pixel-level deep learning approach. April

  18. [18]

    doi: 10.48550/ARXIV.2604.03182

  19. [19]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. 2024. ICLR 2024 Spotlight

  20. [20]

    , author Barekatain, M

    Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Marc Brockschmidt, Pushmeet Kohli, et al. Mathematical discoveries from program search with large language models.Nature, 625(7995):468–475, 2024. doi: 10.1038/s41586-023-06924-6

  21. [21]

    Stoppa, T

    F. Stoppa, T. Bulmus, et al. Textual interpretation of transient image classifications from large language models.Nature Astronomy, 9:1869–1878, 2025. doi: 10.1038/s41550-025-02670-z

  22. [22]

    Interpreting multi-band galaxy observations with large language model-based agents

    Zechang Sun, Yuan-Sen Ting, Yaobo Liang, Nan Duan, Song Huang, and Zheng Cai. Interpreting multi-band galaxy observations with large language model-based agents. 2024

  23. [23]

    Riggi, T

    S. Riggi, T. Cecconello, A. Pilzer, et al. radio-llava: Advancing vision-language models for radio astronomical source analysis.Publications of the Astronomical Society of Australia, 2025. doi: 10.1017/ pasa.2025.10082

  24. [24]

    Radio astronomy in the era of vision-language models: Prompt sensitivity and adaptation

    Mariia Drozdova, Erica Lastufka, Vitaliy Kinakh, et al. Radio astronomy in the era of vision-language models: Prompt sensitivity and adaptation. 2025

  25. [25]

    AstroMLab 1: Who wins astronomy jeopardy!? 2024

    Yuan-Sen Ting, Tuan Dung Nguyen, Tirthankar Ghosal, et al. AstroMLab 1: Who wins astronomy jeopardy!? 2024

  26. [26]

    AstroMLab 3: Achieving GPT-4o level performance in astronomy with a specialized 8b-parameter large language model.Scientific Reports, 15(1):13751, 2025

    Tijmen de Haan, Yuan-Sen Ting, Tirthankar Ghosal, et al. AstroMLab 3: Achieving GPT-4o level performance in astronomy with a specialized 8b-parameter large language model.Scientific Reports, 15(1):13751, 2025. doi: 10.1038/s41598-025-97131-y

  27. [27]

    AstroMMBench: A benchmark for evaluating multimodal large language models capabilities in astronomy

    Jie Shi, Xin Tang, Yi Huang, Yong Li, Xiao Kong, Yanxia Zhang, and Chengze Yue. AstroMMBench: A benchmark for evaluating multimodal large language models capabilities in astronomy. 2025

  28. [28]

    and Arimoto, N

    Hiroaki Aihara, Nobuo Arimoto, Robert Armstrong, Stéphane Arnouts, Neta A Bahcall, Steven Bickerton, James Bosch, Kevin Bundy, Peter L Capak, James H H Chan, Masashi Chiba, Jean Coupon, Eiichi Egami, Motohiro Enoki, Francois Finet, Hiroki Fujimori, Seiji Fujimoto, Hisanori Furusawa, Junko Furusawa, 20 AstroVLBench Tomotsugu Goto, Andy Goulding, Johnny P G...

  29. [29]

    J. J. Condon, W. D. Cotton, E. W. Greisen, Q. F. Yin, R. A. Perley, G. B. Taylor, and J. J. Broderick. The nrao vla sky survey.The Astronomical Journal, 115(5):1693–1716, May 1998. ISSN 0004-6256. doi: 10.1086/300337

  30. [30]

    Mirabest: a data set of morphologically classified radio galaxies for machine learning.RAS Techniques and Instruments, 2(1):293–306, January 2023

    Fiona A M Porter and Anna M M Scaife. Mirabest: a data set of morphologically classified radio galaxies for machine learning.RAS Techniques and Instruments, 2(1):293–306, January 2023. ISSN 2752-8200. doi: 10.1093/rasti/rzad017

  31. [31]

    Seong Jin Kim, Nagisa Oi, Tomotsugu Goto, Hiroyuki Ikeda, Simon C-C Ho, Hyunjin Shim, Yoshiki Toba, Ho Seong Hwang, Tetsuya Hashimoto, Laia Barrufet, Matthew Malkan, Helen K Kim, Ting-Chi Huang, Hideo Matsuhara, Takamitsu Miyaji, Chris Pearson, Stephen Serjeant, Daryl Joe D Santos, Eunbin Kim, Agnieszka Pollo, Woong-Seob Jeong, Ting-Wen Wang, Rieko Momose...

  32. [32]

    AI Malz, Renée Hložek, Tarek Allam Jr, Anita Bahmanyar, Rahul Biswas, Mi Dai, Lluís Galbany, EEO Ishida, SW Jha, DO Jones, et al. The photometric lsst astronomical time-series classification challenge plasticc: Selection of a performance metric for classification probabilities balancing diverse science goals.The Astronomical Journal, 158(5):171, 2019

  33. [33]

    OpenAI GPT-5 System Card

    OpenAI. OpenAI GPT-5 System Card.arXiv preprint arXiv:2601.03267, 2025. URLhttps://openai. com/index/introducing-gpt-5-2/

  34. [34]

    Claude Opus 4.5

    Anthropic. Claude Opus 4.5. Anthropic Blog, 2025. URLhttps://www.anthropic.com/news/ claude-opus-4-5. 21 AstroVLBench

  35. [35]

    Gemini 3 Pro

    Google DeepMind. Gemini 3 Pro. Google Blog, 2025. URL https://blog.google/ products-and-platforms/products/gemini/gemini-3/#responsible-development

  36. [36]

    xAI. Grok-4. xAI News, 2025. URLhttps://x.ai/news/grok-4

  37. [37]

    Qwen3 Technical Report

    Qwen Team. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388, 2025. URLhttps://qwen. ai/blog?id=qwen3

  38. [38]

    Intern-s1-pro: Scientific multimodal foundation model at trillion scale, 2026

    InternLM Team. Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale.arXiv preprint arXiv:2603.25040, 2026. URLhttps://arxiv.org/abs/2603.25040

  39. [39]

    Unblinded data for plasticc classification challenge, January

    PLASTICC Team and PLASTICC Modelers. Unblinded data for plasticc classification challenge, January

  40. [40]

    URLhttps://doi.org/10.5281/zenodo.2539456

  41. [41]

    Stevens, S

    Euclid Collaboration, G. Stevens, S. Fotopoulou, M. N. Bremer, T. Matamoro Zatarain, K. Jahnke, B. Margalef-Bentabol, M. Huertas-Company, M. J. Smith, M. Walmsley, M. Salvato, M. Mezcua, A. Paulino-Afonso, M. Siudek, M. Talia, F. Ricci, W. Roster, N. Aghanim, B. Altieri, S. Andreon, H. Aussel, C. Baccigalupi, M. Baldi, S. Bardelli, P. Battaglia, A. Bivian...

  42. [42]

    Hopkins, Giuseppe Vizzari, Concetto Spampinato, and Simone Palazzo

    Thomas Cecconello, Simone Riggi, Ugo Becciani, Fabio Vitello, Andrew M. Hopkins, Giuseppe Vizzari, Concetto Spampinato, and Simone Palazzo. Self-supervised learning for radio-astronomy source classification: A benchmark. pages 424–439, November 2025. ISSN 1611-3349. doi: 10.1007/ 978-3-031-88217-3_32

  43. [43]

    Machine learning-based photometric classification of galaxies, quasars, emission-line galaxies, and stars.MNRAS, 527(3):4677–4689, November 2023

    Fatemeh Zahra Zeraatgari, Fatemeh Hafezianzadeh, Yanxia Zhang, Liquan Mei, Ashraf Ayubinia, Amin Mosallanezhad, and Jingyi Zhang. Machine learning-based photometric classification of galaxies, quasars, emission-line galaxies, and stars.MNRAS, 527(3):4677–4689, November 2023. ISSN 1365-

  44. [44]

    doi: 10.1093/mnras/stad3436

  45. [45]

    Avocado: Photometric classification of astronomical transients with gaussian process augmentation.AJ, 158(6):257, December 2019

    Kyle Boone. Avocado: Photometric classification of astronomical transients with gaussian process augmentation.AJ, 158(6):257, December 2019. ISSN 1538-3881. doi: 10.3847/1538-3881/ab5182

  46. [46]

    Yesuf, Yongquan Xue, Lalitwadee Kawinwanichakij, Yoshiki Matsuoka, Yoshiki Toba, Tohru Nagao, Malte Schramm, and Kohei Inayoshi

    JunyaoLi, JohnD.Silverman, XuhengDing, MichaelA.Strauss, AndyGoulding, SimonBirrer, HassenM. Yesuf, Yongquan Xue, Lalitwadee Kawinwanichakij, Yoshiki Matsuoka, Yoshiki Toba, Tohru Nagao, Malte Schramm, and Kohei Inayoshi. The sizes of quasar host galaxies in the hyper suprime-cam subaru strategic program.The Astrophysical Journal, 918(1):22, August 2021. ...

  47. [47]

    , keywords =

    Luc Simard, J. Trevor Mendel, David R. Patton, Sara L. Ellison, and Alan W. McConnachie. A catalog of bulge+disk decompositions and updated photometry for 1.12 million galaxies in the sloan digital sky survey.The Astrophysical Journal Supplement Series, 196(1):11, August 2011. ISSN 1538-4365. doi: 10.1088/0067-0049/196/1/11

  48. [48]

    ۨzעO. #[8?LϚ ^ [E o[p! k D]og5C첌9Ӌ+l+ 7J<QEKz8

    Hiroaki Aihara, Yusra AlSayyad, Makoto Ando, Robert Armstrong, James Bosch, Eiichi Egami, Hisanori Furusawa, Junko Furusawa, Sumiko Harasawa, Yuichi Harikane, Bau-Ching Hsieh, Hiroyuki Ikeda, Kei Ito, IkuruIwata, TadayukiKodama, MichitaroKoike, MitsuruKokubo, YutakaKomiyama, XiangchongLi, Yongming Liang, Yen-Ting Lin, Robert H Lupton, Nate B Lust, Lauren ...

  49. [49]

    D. Coe. Trilogy: Image composition software, August 2012

  50. [50]

    B. L. Fanaroff and J. M. Riley. The morphology of extragalactic radio sources of high and low luminosity. Monthly Notices of the Royal Astronomical Society, 167(1):31P–36P, April 1974. ISSN 1365-2966. doi: 10.1093/mnras/167.1.31p

  51. [51]

    J. A. Baldwin, M. M. Phillips, and R. Terlevich. Classification parameters for the emission-line spectra of extragalactic objects.Publications of the Astronomical Society of the Pacific, 93:5, February 1981. ISSN 1538-3873. doi: 10.1086/130766. 23 AstroVLBench

  52. [52]

    , keywords =

    Sylvain Veilleux and Donald E. Osterbrock. Spectral classification of emission-line galaxies.ApJS, 63: 295, February 1987. ISSN 1538-4365. doi: 10.1086/191166

  53. [53]

    FastSpecFit: Fast spectral synthesis and emission-line fitting of DESI spectra

    John Moustakas, Jeremy Buhler, Dirk Scholte, Biprateep Dey, and Ashod Khederlarian. FastSpecFit: Fast spectral synthesis and emission-line fitting of DESI spectra. Astrophysics Source Code Library, record ascl:2308.005, August 2023. 24 AstroVLBench

  54. [54]

    answer”: “

    Supplementary Materials Supplementary Tables Table S1:Task 1: QSO Host Galaxy Classification.Accuracy with 95% bootstrap confidence intervals for binary AGN/Galaxy classification (N=557; 10,000 bootstrap iterations). Model Accuracy 95% CI Gemini 3 Pro0.745[0.709, 0.781] GPT-5.2 0.652 [0.612, 0.691] Grok-4 0.636 [0.596, 0.675] Intern-S1-Pro 0.582 [0.540, 0...

  55. [55]

    Analyze core sharpness: determine whether the core is an unresolved point source (PSF-like) or a resolved, extended structure

  56. [56]

    Check for optical artifacts such as diffraction spikes or saturation bleeding

  57. [57]

    answer”: “

    Evaluate the transition from the center to the disk: smooth and gradual (bulge-like) versus a sharp point superposed on extended light (AGN-like). - If the center is dominated by an unresolved point source consistent with a Type-1 AGN, respond: AGN - If the center is dominated by a resolved stellar bulge, respond: Galaxy Output requirements: - Respond wit...

  58. [58]

    Star-Forming: Low [Nii]/Hαand low [Oiii]/Hβ(ionization dominated by young stars)

  59. [59]

    Composite: Intermediate region between SF and AGN (mixed ionization sources)

  60. [60]

    Seyfert: High [Oiii]/Hβ, high [Nii]/Hα(AGN-dominated ionization)

  61. [61]

    answer”: “

    LINER: High [Nii]/Hα, low [Oiii]/Hβ(low-ionization nuclear emission region) Output requirements: - Respond with a JSON object in the following format: {“answer”: “”, “reason”: “”} - The “answer” field must be one of: Star-Forming, Composite, Seyfert, or LINER - The “reason” field should contain a brief explanation - Do not include any text outside the JSO...