arxiv: 2512.01270 · v2 · submitted 2025-12-01 · 🌌 astro-ph.IM · astro-ph.GA· astro-ph.SR

Egent: An Autonomous Agent for Equivalent Width Measurement

Yuan-Sen Ting , Serat Mahmud Saad , Fan Liu , Yuting Shen This is my paper

Pith reviewed 2026-05-17 03:34 UTC · model grok-4.3

classification 🌌 astro-ph.IM astro-ph.GAastro-ph.SR

keywords equivalent width measurementVoigt profile fittingLLM agentstellar spectroscopyquality control automationraw flux spectraiterative refinement

0 comments p. Extension

The pith

An autonomous agent matches human experts when measuring equivalent widths in raw stellar spectra.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Egent as a system that pairs a built-from-scratch multi-Voigt fitting routine with large-language-model visual inspection and iterative refinement. The agent works directly on un-normalized flux spectra, using function calls to adjust windows, add blend components, alter continuum placement, and flag bad cases. Validation on 18,615 lines across 84 spectra yields a median absolute deviation of 5-7 mA against expert measurements with no per-spectrum corrections applied. Slopes near unity between automated and manual values indicate that residual differences arise mainly from continuum choices rather than fitting failures. The approach stores every Voigt parameter, continuum coefficient, and reasoning step so any fit can be reconstructed exactly.

Core claim

Egent integrates classical multi-Voigt profile fitting with LLM-driven quality control that confirms good fits, flags problematic lines, and occasionally rescues edge cases through tool-based refinement. The system requires no pre-normalized continua and operates on raw spectra at signal-to-noise ratios of 50-250. Direct comparison to manual expert measurements on 18,615 lines shows raw agreement at MAD of 5-7 mA, with per-spectrum slopes of 0.85-1.19 reflecting global continuum methodology differences rather than fitting inaccuracies. The LLM accepts roughly 60-65 percent of lines after refinement, flags 10-20 percent as problematic, and enables full reproducibility through stored parameter

What carries the argument

LLM agent that inspects Voigt-profile fits visually and issues function calls to refine wavelength windows, add blends, adjust continua, or flag cases.

If this is right

Survey-scale equivalent-width catalogs become feasible by reducing months of expert labor to days of automated runs.
Every measurement remains exactly reproducible because full Voigt parameters, continuum coefficients, and reasoning chains are stored.
Smaller language models can perform the same task at low cost, reaching roughly 200 lines per dollar while preserving agreement with larger models.
Offline analysis is possible with a local model backend and no external dependencies beyond the fitting engine.
A web interface allows drag-and-drop processing of individual spectra without custom scripting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same agent architecture could be adapted to other line-strength or abundance measurements if the LLM is given analogous quality-control tasks.
Large-scale application across many instruments would likely expose systematic differences in continuum placement that are currently absorbed in the observed slopes.
Community testing on additional datasets could reveal whether the current validation set already captures the full range of edge cases the LLM might encounter.

Load-bearing premise

The language model can perform consistent visual quality control and iterative refinement across varying spectral conditions without introducing undetected systematic biases.

What would settle it

A blind comparison of Egent results against new expert measurements on spectra from a different instrument or at substantially lower signal-to-noise would show whether the 5-7 mA agreement persists.

Figures

Figures reproduced from arXiv: 2512.01270 by Fan Liu, Serat Mahmud Saad, Yuan-Sen Ting, Yuting Shen.

**Figure 1.** Figure 1: Raw Magellan/MIKE ´echelle spectrum illustrating challenges for automated EW measurement. Top panel: Multiple ´echelle orders (different colors) showing the characteristic blaze function—the instrumental response that modulates the observed flux with a curved envelope peaking near each order’s center. Absorption lines appear as narrow dips superimposed on this curved background. Bottom panel: Single order … view at source ↗

**Figure 2.** Figure 2: Schematic overview of the Egent pipeline. Input spectra (shifted to stellar rest frame with empirical wavelength calibration) and line catalogs enter the direct multi-Voigt fitting stage. Quality metrics determine whether the fit is acceptable or requires LLM inspection. The LLM visual inspector examines diagnostic plots and can adjust the extraction window ( [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Multi-Voigt fitting process demonstrated on Fe I 5242.49 ˚A (Gaia ID 55780840513067392, SNR ∼ 160). Top panel: The normalized spectrum (black) is fit simultaneously with 8 Voigt components (orange and green dashed lines indicate individual component centers; the target component is shown in green). The combined multi-Voigt model (red) reproduces the complex blend structure. Continuum normalization is perf… view at source ↗

**Figure 4.** Figure 4: Example of LLM-driven window and continuum adjustment for Na I 6154.23 ˚A. (a) Simple fit: The initial 6 ˚A window includes edge features (visible at 6156–6157 ˚A) that introduce systematic bias in the residuals. The yellow shaded region indicates where the LLM determined a narrower window would improve the fit. (b) Egent fit: After the agent narrowed the window to 3.9 ˚A and switched to a quadratic contin… view at source ↗

**Figure 5.** Figure 5: Example of LLM-driven peak identification for Ni I 5643.08 ˚A. (a) Direct fit: Nine Voigt components (blue shaded regions) were identified automatically from the line catalog. While these capture most absorption features, residuals at ∼5642.5 ˚A and ∼5645.8 ˚A show systematic deviations exceeding 2σ, indicating missed features. (b) Egent fit: By visually inspecting the diagnostic plot, the LLM identified t… view at source ↗

**Figure 6.** Figure 6: Before and after comparison of LLM-proposed continuum regions for Fe I 6127.91 ˚A. (a) Before: Automatic iterative continuum fitting struggles in this crowded region, resulting in a systematic offset between data and model (RMS = 12.9σ). The normalized flux appears above unity on the blue side while absorption features are incompletely modeled. (b) After: Based on visual inspection, the LLM identified thre… view at source ↗

**Figure 7.** Figure 7: Representative examples of flagged lines. Lines are flagged by the LLM during iterative fitting when it determines the fit cannot be salvaged—elevated RMS values indicate continuum failures, severe blends, or model-data mismatch. The flagging mechanism identifies cases where automated fitting produces unreliable measurements, preventing these outliers from contaminating abundance analyses. All flagging use… view at source ↗

**Figure 8.** Figure 8: 1-to-1 comparison of Egent vs C3PO EW measurements. Top row: GPT-5 results for three representative spectra. Bottom row: GPT-5-mini results. Blue filled circles: accepted measurements. Hollow red circles: lines flagged by the LLM. Dark line: best-fit linear regression on unflagged points only. Shaded bands: 5%, 10%, 15%, 20% deviation from 1:1. Flagged points cluster at larger deviations, validating the AI… view at source ↗

**Figure 9.** Figure 9: Distribution of correlation quality (R2 ) and systematic offset (slope) across all spectra with C3PO catalog matches, comparing GPT-5 (21 spectra) and GPT-5-mini (63 spectra). Spectra with poor wavelength calibration (R2 < 0.8) were excluded from the analysis. (a) Both models achieve median R2 ≈ 0.93, with all spectra above 0.82, indicating strong linear correlation between Egent and expert measurements. (… view at source ↗

**Figure 10.** Figure 10: Distribution of absolute EW deviations |Egent − C3PO| separated by flagging status for (a) GPT-5 (21 spectra, 4,702 lines) and (b) GPT-5-mini (63 spectra, 13,913 lines). Green: accepted lines (MAD = 5–7 m˚A). Red: flagged lines (MAD = 10–20 m˚A). Both models show the same pattern: flagged lines have systematically larger deviations from catalog values, confirming that the AI flagging effectively identifie… view at source ↗

**Figure 11.** Figure 11: Fitting quality across SNR levels for two representative lines: Sc II 5526.82 ˚A (red, top) and Ba II 5853.67 ˚A (green, bottom). Each column shows the same line observed in different spectra at SNR = 30–313. Top panels: normalized flux (dark) with Voigt model overlay. Bottom panels: normalized residuals with ±1σ (green) and ±2σ (blue) bands. Even at SNR = 30, the measured EW agrees with C3PO to within 3 … view at source ↗

read the original abstract

We present Egent, an autonomous agent that combines classical multi-Voigt profile fitting with large language model (LLM) visual inspection and iterative refinement. The fitting engine is built from scratch with minimal dependencies, creating an ecosystem where the LLM can reason about fits through function calls--adjusting wavelength windows, adding blend components, modifying continuum treatment, and flagging problematic cases. Egent operates directly on raw flux spectra without requiring pre-normalized continua. We validate against manual measurements from human experts using 18,615 lines from the C3PO program across 84 Magellan/MIKE spectra at SNR~50-250. The raw agreement between Egent and expert measurements is MAD=5-7mA, without any post-hoc per-spectrum correction. Per-spectrum slopes of ~0.85-1.19 around unity reflect differences in global continuum methodology rather than fitting failures. The LLM's primary role is quality control: it confirms good fits (~60-65% of lines are LLM-refined and accepted), flags problematic cases (~10-20%), and occasionally rescues edge cases where tool use improves fits. Agreement between GPT-5 and GPT-5-mini confirms reproducibility, with GPT-5-mini enabling low-cost analysis at ~200 lines per US dollar. Every fit stores complete Voigt parameters, continuum coefficients, and LLM reasoning chains, enabling exact reconstruction without re-running. Egent compresses what traditionally requires months of expert effort into days of automated analysis, enabling survey-scale EW measurement. We provide open-source code at https://github.com/tingyuansen/Egent, including a web interface for drag-and-drop analysis and a local LLM backend for fully offline operation on consumer hardware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Egent pairs a from-scratch Voigt fitter with LLM function calls for window, blend, and QC adjustments, delivering MAD 5-7 mA raw agreement on 18k MIKE lines, but the single-instrument test set leaves generalization unproven.

read the letter

The main thing to know is that Egent builds a practical autonomous pipeline for equivalent width work by wiring a custom multi-Voigt engine directly to an LLM that calls functions to shift windows, add blends, tweak continua, and flag bad cases on raw spectra. It reports MAD agreement of 5-7 mA against human experts on 18,615 lines from 84 MIKE spectra at SNR 50-250, with no post-hoc per-spectrum scaling applied. Per-spectrum slopes of 0.85-1.19 are attributed to continuum choices rather than fitting errors, which is a straightforward acknowledgment.

Referee Report

2 major / 3 minor

Summary. The manuscript presents Egent, an autonomous agent that integrates a from-scratch multi-Voigt profile fitting engine with LLM-driven visual inspection, iterative refinement via function calls (adjusting windows, blends, continuum), and quality control. It operates on raw flux spectra and is validated on 18,615 lines from 84 Magellan/MIKE spectra (C3PO program, SNR ~50-250), reporting raw MAD agreement of 5-7 mÅ with human experts without post-hoc corrections. Per-spectrum slopes of 0.85-1.19 are attributed to continuum differences; ~60-65% of lines are LLM-refined and accepted, ~10-20% flagged. The system stores full Voigt parameters, continuum coefficients, and reasoning chains for reproducibility. Open-source code, web interface, and local LLM backend are provided.

Significance. If the central claims hold, Egent could substantially accelerate equivalent-width measurements for large spectroscopic surveys by compressing months of expert labor into days of automated processing. Notable strengths include the large independent validation sample (18,615 lines), direct comparison to human experts rather than self-referential metrics, full storage of parameters and LLM reasoning for exact reconstruction, reproducibility between GPT-5 and GPT-5-mini, and open-source release with low-cost (~200 lines per USD) and offline options. These elements support practical adoption if generalization is demonstrated.

major comments (2)

[Validation and Results] Validation section: The reported raw MAD=5-7 mÅ agreement and ~10-20% flagging rate are demonstrated exclusively on Magellan/MIKE spectra at SNR 50-250. Because the LLM agent performs visual QC, iterative window/continuum/blend adjustments, and flagging through function calls, the absence of cross-instrument tests (different line-spread functions, telluric patterns, or noise regimes) leaves open the possibility of undetected instrument-specific or SNR-dependent systematics. This directly bears on the claim of enabling survey-scale EW measurement across instruments.
[Methods] Methods and abstract: Exact LLM prompt specifications, decision criteria for function calls (e.g., thresholds for adding blends or flagging), and quantitative error propagation from the Voigt fits are not provided. These details are load-bearing for evaluating whether the autonomous refinements introduce biases not captured in the current MIKE-only validation set.

minor comments (3)

[Abstract] The abstract states that per-spectrum slopes of 0.85-1.19 reflect continuum methodology differences; a brief quantitative illustration or reference to the relevant figure/table would improve clarity.
Consider adding a short description of how uncertainties on the final EW values are derived from the stored Voigt parameters and continuum coefficients.
Ensure all acronyms (e.g., C3PO, MIKE) are defined at first use and that figure captions explicitly label the meaning of slope values and flagging categories.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed report. The comments identify important areas for strengthening the manuscript, particularly regarding generalization and methodological transparency. We address each major comment below and will incorporate revisions to improve clarity and completeness.

read point-by-point responses

Referee: [Validation and Results] Validation section: The reported raw MAD=5-7 mÅ agreement and ~10-20% flagging rate are demonstrated exclusively on Magellan/MIKE spectra at SNR 50-250. Because the LLM agent performs visual QC, iterative window/continuum/blend adjustments, and flagging through function calls, the absence of cross-instrument tests (different line-spread functions, telluric patterns, or noise regimes) leaves open the possibility of undetected instrument-specific or SNR-dependent systematics. This directly bears on the claim of enabling survey-scale EW measurement across instruments.

Authors: We agree that the current validation is limited to a single instrument and SNR range, which is a genuine limitation for claims of broad survey applicability. The choice of the C3PO Magellan/MIKE dataset was driven by the availability of a large (18,615-line) independent expert comparison set that enables direct, uncorrected statistical assessment. In the revised manuscript we will add a new subsection in the Discussion explicitly addressing this limitation, including discussion of potential instrument-specific effects (e.g., line-spread function differences and telluric contamination) and a clear statement that cross-instrument validation remains future work. We will also qualify the abstract and conclusions to reflect that the demonstrated performance is for MIKE-like data while the underlying method is designed to be instrument-agnostic. revision: yes
Referee: [Methods] Methods and abstract: Exact LLM prompt specifications, decision criteria for function calls (e.g., thresholds for adding blends or flagging), and quantitative error propagation from the Voigt fits are not provided. These details are load-bearing for evaluating whether the autonomous refinements introduce biases not captured in the current MIKE-only validation set.

Authors: We concur that these details are essential for reproducibility and for assessing possible biases introduced by the LLM-driven refinements. The present manuscript describes the overall architecture but omits the precise prompts, thresholds, and error-propagation formalism. In the revised version we will add a new appendix containing (i) the complete LLM prompt templates, (ii) the explicit decision criteria and numerical thresholds used for function calls (e.g., blend-addition and flagging rules), and (iii) a quantitative description of uncertainty estimation and propagation from the multi-Voigt least-squares fits, including the covariance matrix returned by the fitting engine. revision: yes

Circularity Check

0 steps flagged

No circularity: validation is direct empirical comparison to independent human expert measurements

full rationale

The paper describes an engineering tool (Egent) that combines classical Voigt fitting with LLM-driven iterative refinement and quality control. Its central result is an empirical validation metric (MAD=5-7 mÅ raw agreement on 18,615 lines) obtained by direct comparison against manual measurements performed by human experts on a fixed set of Magellan/MIKE spectra. No derivation chain, first-principles prediction, or fitted parameter is presented whose output reduces by construction to the input data or to a self-citation. The per-spectrum slope variations are explicitly attributed to continuum methodology differences rather than being treated as a derived prediction. No uniqueness theorems, ansatzes smuggled via prior self-work, or renaming of known results appear as load-bearing steps. The validation therefore remains externally falsifiable against the human reference set and does not collapse into tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard spectroscopic modeling assumptions and the ability of current LLMs to reason about fit quality; no new physical entities are postulated.

free parameters (1)

continuum polynomial coefficients
Fitted per spectrum as part of the model; not a global free parameter of the overall method.

axioms (1)

domain assumption Stellar absorption lines can be adequately represented by sums of Voigt profiles plus a low-order polynomial continuum
This is the foundational modeling choice of the fitting engine invoked throughout the validation.

invented entities (1)

Egent autonomous agent no independent evidence
purpose: To orchestrate classical fitting and LLM reasoning for end-to-end equivalent width measurement
The agent is the primary contribution of the work; independent evidence is limited to the reported validation set.

pith-pipeline@v0.9.0 · 5620 in / 1421 out tokens · 76158 ms · 2026-05-17T03:34:08.478125+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · 7 internal anchors

[1]

The Apache Point Observatory Galactic Evolution Experiment (APOGEE)

Prieto, R. Barkhouser, D. Bizyaev, B. Blank, S. Brunner, A. Burton, R. Carrera, et al., AJ154, 94 (2017), 1509.05420. S. Buder, J. Kos, X. E. Wang, M. McKenzie, M. Howell, S. Martell, M. R. Hayden, D. B. Zucker, T. Nordlander, B. Montet, et al., PASA42, e051 (2025), 2409.19858. S. Randich, G. Gilmore, L. Magrini, G. G. Sacco, R. J. Jackson, R. D. Jeffries...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

SDSS-V: Pioneering Panoptic Spectroscopy

Anderson, N. Drory, J. A. Johnson, R. W. Pogge, J. C. Bird, G. A. Blanc, et al., arXiv e-prints arXiv:1711.03234 (2017), 1711.03234. R. S. de Jong, O. Bellido-Tirado, C. Chiappini, ´E. Depagne, R. Haynes, D. Johl, O. Schnurr, A. Schwope, J. Walcher, F. Dionies, et al., inGround-based and Airborne Instrumentation for Astronomy IV, edited by I. S. McLean, S...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

Project overview and update on WEAVE: the next generation wide-field spectroscopy facility for the William Herschel Telescope

Aguerri, K. Middleton, C. Benn, K. Dee, F. Say` ede, I. Lewis, et al., inGround-based and Airborne Instrumentation for Astronomy V, edited by S. K. Ramsay, I. S. McLean, and H. Takami (2014), vol. 9147 ofSociety of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, p. 91470L, 1412.0843. M. Ness, D. W. Hogg, H.-W. Rix, A. Y. Q. Ho, and G. Za...

work page internal anchor Pith review Pith/arXiv arXiv 2014
[4]

Abundances, Stellar Parameters, and Spectra From the SDSS-III/APOGEE Survey

Blanton, J. Bovy, et al., AJ150, 148 (2015), 1501.04110. A. E. Garc´ ıa P´ erez, C. Allende Prieto, J. A. Holtzman, M. Shetrone, S. M´ esz´ aros, D. Bizyaev, R. Carrera, K. Cunha, D. A. Garc´ ıa-Hern´ andez, J. A. Johnson, et al., AJ151, 144 (2016), 1510.07635. Y.-S. Ting, C. Conroy, H.-W. Rix, and P. Cargile, ApJ879, 69 (2019), 1804.01530. M. Xiang, H.-W...

work page internal anchor Pith review Pith/arXiv arXiv 2015
[5]

Calibrating the metallicity of M dwarfs in wide physical binaries with F-, G-, and K- primaries - I: High-resolution spectroscopy with HERMES: stellar parameters, abundances, and kinematics

Caballero, E. Marfil, F. J. Alonso-Floriano, M. Cort´ es-Contreras, J. I. Gonz´ alez Hern´ andez, A. Klutsch, and C. Moreno-J´ odar, MNRAS479, 1332 (2018), 1805.05394. P. Scott, M. Asplund, N. Grevesse, M. Bergemann, and A. J

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Sauval, A&A573, A26 (2015a), 1405.0287. P. Scott, N. Grevesse, M. Asplund, A. J. Sauval, K. Lind, Y. Takeda, R. Collet, R. Trampedach, and W. Hayek, A&A 573, A25 (2015b), 1405.0279. M. Asplund, ˚A. Nordlund, R. Trampedach, and R. F. Stein, A&A 359, 743 (2000), astro-ph/0005321. M. Asplund, N. Grevesse, A. J. Sauval, and P. Scott, ARA&A 47, 481 (2009), 090...

work page internal anchor Pith review Pith/arXiv arXiv 2000
[7]

Joyce, A

Murphy, M. Joyce, A. Dotter, and F. Dai, Nature627, 501 (2024), 2403.13209. L. Spina, arXiv e-prints arXiv:2401.12296 (2024), 2401.12296. S. G. Sousa, N. C. Santos, G. Israelian, M. Mayor, and M. J. P. F. G. Monteiro, A&A469, 783 (2007), astro-ph/0703696. P. B. Stetson and E. Pancino, PASP120, 1332 (2008), 0811.2932. S. Blanco-Cuaresma, C. Soubiran, U. He...

work page arXiv 2024
[8]

GPT-4 Technical Report

Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., arXiv e-prints arXiv:2303.08774 (2023), 2303.08774. S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, arXiv e-prints arXiv:2210.03629 (2022), 2210.03629. D. A. Boiko, R. MacKnight, and G. Gomes, arXiv e-prints arXiv:2304.05332 (2023), 2304.05332. A. M. Bran, S. Cox, O....

work page internal anchor Pith review Pith/arXiv arXiv 2023