Spec-o3: A Tool-Augmented Vision-Language Agent for Rare Celestial Object Candidate Vetting via Automated Spectral Inspection

Ali Luo; Dongbin Zhao; Hailing Lu; Linjing Li; Minghui Jia; Qichao Zhang; Shuo Ye; Wen Hou

arxiv: 2601.06498 · v3 · submitted 2026-01-10 · 💻 cs.CL · astro-ph.IM

Spec-o3: A Tool-Augmented Vision-Language Agent for Rare Celestial Object Candidate Vetting via Automated Spectral Inspection

Minghui Jia , Qichao Zhang , Ali Luo , Linjing Li , Shuo Ye , Hailing Lu , Wen Hou , Dongbin Zhao This is my paper

Pith reviewed 2026-05-16 15:25 UTC · model grok-4.3

classification 💻 cs.CL astro-ph.IM

keywords vision language agentspectral inspectionrare celestial objectstool augmentationchain of thought reasoningLAMOSTreinforcement learningastronomical surveys

0 comments

The pith

Spec-o3 is a vision-language agent that automates expert spectral inspection to identify rare celestial objects, raising macro-F1 from 28.3 to 76.5 on LAMOST tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Spec-o3, a tool-augmented vision-language agent designed to replicate the way astronomers use specialized tools to inspect spectra and vet rare object candidates. Deep learning classifiers struggle with generalization and interpretability, so final vetting remains a manual expert process that cannot keep pace with the data volume from modern surveys. Spec-o3 addresses this by training first through supervised fine-tuning on expert inspection trajectories and then with outcome-based reinforcement learning, enabling interleaved multimodal chain-of-thought reasoning. On five rare-object tasks from LAMOST it achieves state-of-the-art performance with a 7B model while generalizing to SDSS and DESI data and producing physically consistent reasoning traces. This approach could make large-scale catalog construction feasible by removing the expert bottleneck.

Core claim

Spec-o3 is a tool-augmented vision-language agent that performs astronomer-aligned spectral inspection via interleaved multimodal chain-of-thought reasoning. It is trained with a two-stage post-training recipe consisting of cold-start supervised fine-tuning on expert inspection trajectories followed by outcome-based reinforcement learning on rare-type verification tasks. Evaluated on five rare-object identification tasks from LAMOST, it establishes a new state-of-the-art by boosting the macro-F1 score from 28.3 to 76.5 with a 7B parameter base model, outperforming both proprietary vision-language models and specialized deep models. The agent shows strong generalization to unseen inspection任务

What carries the argument

Interleaved multimodal chain-of-thought reasoning that calls external spectral analysis tools and incorporates their results into the agent's decision process.

Load-bearing premise

The expert inspection trajectories used for supervised fine-tuning accurately represent reliable vetting practices and the reinforcement learning produces robust generalization without overfitting to the training tasks.

What would settle it

Independent expert review of the agent's reasoning traces on a new set of rare-object candidates from an unseen survey, showing either macro-F1 below 50 or majority judgment of physical inconsistency, would falsify the reliability of the automated vetting.

read the original abstract

Due to the limited generalization and interpretability of deep learning classifiers, The final vetting of rare celestial object candidates still relies on expert visual inspection--a manually intensive process. In this process, astronomers leverage specialized tools to analyze spectra and construct reliable catalogs. However, this practice has become the primary bottleneck, as it is fundamentally incapable of scaling with the data deluge from modern spectroscopic surveys. To bridge this gap, we propose Spec-o3, a tool-augmented vision-language agent that performs astronomer-aligned spectral inspection via interleaved multimodal chain-of-thought reasoning. Spec-o3 is trained with a two-stage post-training recipe: cold-start supervised fine-tuning on expert inspection trajectories followed by outcome-based reinforcement learning on rare-type verification tasks. Evaluated on five rare-object identification tasks from LAMOST, Spec-o3 establishes a new State-of-the-Art, boosting the macro-F1 score from 28.3 to 76.5 with a 7B parameter base model and outperforming both proprietary VLMs and specialized deep models. Crucially, the agent demonstrates strong generalization to unseen inspection tasks across survey shifts (from LAMOST to SDSS/DESI). Expert evaluations confirm that its reasoning traces are coherent and physically consistent, supporting transparent and trustworthy decision-making. Code, data, and models are available at https://github.com/Maxwell-Jia/spec-o3.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Spec-o3 gets a big F1 lift on LAMOST rare-object tasks with its two-stage agent training, but the generalization claim to SDSS/DESI looks under-supported without clearer controls on task overlap.

read the letter

The core takeaway is that Spec-o3 combines a 7B vision-language model with spectral analysis tools and trains it first on expert inspection traces then with outcome-based RL, producing a jump from 28.3 to 76.5 macro-F1 on five LAMOST rare-object tasks while beating both proprietary VLMs and specialized classifiers. The code and data release is a practical plus. Expert review of the reasoning traces as coherent and physically consistent is also useful for checking whether the outputs align with how astronomers actually work. What stands out as new is the specific two-stage recipe applied to this bottleneck in survey data vetting rather than generic VLM prompting. The performance numbers are large enough to notice and the domain focus gives the work a clear use case. The soft spots sit in the evaluation and the generalization story. The abstract does not spell out baseline details, statistical tests, or how the five tasks were chosen, so it is difficult to judge whether the gains are robust or partly driven by task-specific artifacts. The RL stage rewards only final correctness, which raises the usual risk that the model picks up shared patterns across the training tasks instead of learning general inspection logic. The claim of strong cross-survey generalization would carry more weight with explicit numbers on held-out object classes or a breakdown of performance on entirely new spectral features. Without those, the SDSS/DESI results could reflect distribution overlap rather than true robustness. This paper is aimed at people building AI tools for large spectroscopic surveys who need something more interpretable than black-box classifiers. A reader working on agentic systems for scientific data would find the training approach and the astronomy application worth examining. It deserves peer review because the empirical lift is substantial and the workflow problem is real, even if the generalization evidence needs tightening in revision.

Referee Report

3 major / 2 minor

Summary. The paper introduces Spec-o3, a tool-augmented vision-language agent for astronomer-aligned spectral inspection of rare celestial object candidates. It employs a two-stage post-training approach consisting of supervised fine-tuning on expert inspection trajectories followed by outcome-based reinforcement learning on rare-type verification tasks. Evaluated on five rare-object identification tasks from LAMOST, the 7B-parameter model achieves a new state-of-the-art macro-F1 score of 76.5 (up from 28.3), outperforming proprietary VLMs and specialized deep models, while demonstrating generalization to unseen tasks across survey shifts to SDSS and DESI; expert review confirms coherent and physically consistent reasoning traces.

Significance. If the empirical results and generalization claims hold under rigorous controls, the work would represent a meaningful advance in automating expert-level spectral vetting, directly addressing the scalability bottleneck in processing data from large spectroscopic surveys like LAMOST, SDSS, and DESI. The open release of code, data, and models further strengthens its potential impact on reproducible research in astroinformatics.

major comments (3)

[Evaluation section] Evaluation section (LAMOST tasks and cross-survey generalization): the manuscript reports a macro-F1 jump from 28.3 to 76.5 and strong generalization to SDSS/DESI but provides no quantitative breakdown of held-out tasks, fraction of RL training data overlapping the reported evaluation set, or performance on entirely new object classes. This information is required to distinguish robust inspection logic from memorization of task-specific patterns.
[Training recipe] Training recipe (two-stage SFT + outcome-based RL): the reward is described only as final-answer correctness on rare-type verification; without explicit details on the reward formulation, trajectory sampling, or regularization against spurious correlations (e.g., survey-specific artifacts), the risk of overfitting to the five LAMOST tasks cannot be assessed, undermining the generalization claim.
[Experimental details] Experimental details: the soundness assessment notes absence of methodology specifics, baseline implementations, statistical significance tests, and bias analysis; these omissions are load-bearing because the central SOTA and generalization claims rest on the reported performance numbers.

minor comments (2)

[Introduction] The abstract and introduction use the term 'tool-augmented' without a dedicated subsection enumerating the specific tools (e.g., spectral line identifiers, continuum fitters) and their integration into the interleaved multimodal CoT.
[Figures and Tables] Figure captions and table legends should explicitly state the number of runs, random seeds, and confidence intervals for the macro-F1 scores to allow direct comparison with prior work.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments. We believe the suggested clarifications will strengthen the manuscript's claims regarding the robustness of Spec-o3's performance and generalization. We address each major comment below and commit to revisions accordingly.

read point-by-point responses

Referee: [Evaluation section] Evaluation section (LAMOST tasks and cross-survey generalization): the manuscript reports a macro-F1 jump from 28.3 to 76.5 and strong generalization to SDSS/DESI but provides no quantitative breakdown of held-out tasks, fraction of RL training data overlapping the reported evaluation set, or performance on entirely new object classes. This information is required to distinguish robust inspection logic from memorization of task-specific patterns.

Authors: We agree that providing a more detailed breakdown is essential to substantiate the generalization claims and rule out memorization. In the revised manuscript, we will include: (1) a quantitative breakdown of the held-out tasks, specifying the composition of evaluation sets; (2) the exact fraction of RL training data that overlaps with the evaluation sets; and (3) performance metrics on entirely new object classes not encountered during training. These additions will demonstrate that the performance gains arise from robust, astronomer-aligned reasoning rather than task-specific memorization. revision: yes
Referee: [Training recipe] Training recipe (two-stage SFT + outcome-based RL): the reward is described only as final-answer correctness on rare-type verification; without explicit details on the reward formulation, trajectory sampling, or regularization against spurious correlations (e.g., survey-specific artifacts), the risk of overfitting to the five LAMOST tasks cannot be assessed, undermining the generalization claim.

Authors: We acknowledge the need for greater transparency in the training recipe to allow proper assessment of overfitting risks. The revised manuscript will provide explicit details on the reward formulation (including how correctness is measured and any shaping), the trajectory sampling procedure during RL, and regularization methods employed to mitigate spurious correlations such as survey-specific artifacts. This will enable readers to better evaluate the robustness of the generalization to SDSS and DESI. revision: yes
Referee: [Experimental details] Experimental details: the soundness assessment notes absence of methodology specifics, baseline implementations, statistical significance tests, and bias analysis; these omissions are load-bearing because the central SOTA and generalization claims rest on the reported performance numbers.

Authors: We agree that these experimental details are critical for validating the SOTA and generalization claims. In the revision, we will expand the experimental section to include full methodology specifics, detailed descriptions of baseline implementations (including how proprietary VLMs and specialized deep models were evaluated), results of statistical significance tests (e.g., paired t-tests or McNemar's test with p-values), and a comprehensive bias analysis addressing potential confounders like survey artifacts or class imbalance. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results rest on external evaluation rather than self-referential definitions or fits

full rationale

The paper presents an empirical pipeline: supervised fine-tuning on expert trajectories followed by outcome-based RL, then reports macro-F1 on five LAMOST tasks plus claimed cross-survey generalization. No equations, uniqueness theorems, or self-citations are invoked that would make any reported performance equivalent to the training inputs by construction. The evaluation numbers are presented as measured outcomes on specified tasks, not as predictions derived tautologically from the same data or prior self-work. This is a standard non-circular empirical ML report.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Limited information available from abstract; no specific free parameters or invented physical entities mentioned.

axioms (1)

domain assumption Multimodal chain-of-thought reasoning can be effectively interleaved with tool use in vision-language models.
Core to the proposed method.

invented entities (1)

Spec-o3 agent no independent evidence
purpose: To automate spectral inspection for rare objects.
The proposed system itself.

pith-pipeline@v0.9.0 · 5573 in / 1338 out tokens · 41927 ms · 2026-05-16T15:25:51.202830+00:00 · methodology

Spec-o3: A Tool-Augmented Vision-Language Agent for Rare Celestial Object Candidate Vetting via Automated Spectral Inspection

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)