pith. sign in

arxiv: 2606.29997 · v1 · pith:NKF3BXRTnew · submitted 2026-06-29 · 💻 cs.CV

Rigel: Self-Distilled Score Adaptation for Image and Video Captioning Evaluation

Pith reviewed 2026-06-30 06:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords image captioning evaluationvideo captioning evaluationLLM-as-a-Judgeself-distillationreference-free evaluationhuman judgment alignmentmultimodal metrics
0
0 comments X

The pith

Rigel distills an evaluation-specific scoring head from a frozen LLM to align caption metrics more closely with human judgments without large-vocabulary mismatches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Rigel as a new automatic metric for image and video captioning evaluation. Current metrics and LLM-as-a-Judge methods suffer from poor human alignment because language modeling operates over large vocabularies while evaluation uses small label sets. Rigel solves this by distilling a dedicated scoring head from a frozen LLM into a task-aligned space, then refining the backbone with human judgment data on a new Vid-Lepus dataset of video clips and captions. Experiments across benchmarks show Rigel beats prior metrics, with gains exceeding 10 points on ActivityNet-Fact in the reference-free case. Readers would care because more reliable automatic scores let developers benchmark and improve multimodal systems with less human annotation effort.

Core claim

Rigel introduces self-distilled score adaptation: an evaluation-specific scoring head is distilled from a frozen LLM to capture judgment signals directly in a task-aligned space, bypassing reliance on large-vocabulary token sets; the LLM backbone is then refined using human judgment data. The method is trained on the Vid-Lepus dataset of 3,338 video clips, 33,380 reference captions, and 5,637 candidate captions. On multiple benchmarks Rigel outperforms existing metrics and delivers over 10-point gains on ActivityNet-Fact under reference-free conditions.

What carries the argument

The evaluation-specific scoring head distilled from the frozen LLM, which extracts judgment signals into a dedicated task-aligned space separate from full language modeling.

If this is right

  • Rigel achieves higher correlation with human judgments than prior metrics across image and video captioning benchmarks.
  • The approach yields particularly large gains in reference-free evaluation where no reference captions are supplied.
  • A new Vid-Lepus dataset is provided that pairs video clips with multiple reference and candidate captions for metric training.
  • Refining the LLM backbone on human judgment data further improves alignment after the initial distillation step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distillation pattern could be tested on other generation tasks such as visual question answering or story generation to check whether task-specific heads generalize.
  • If the scoring head proves stable, captioning models might be trained end-to-end by back-propagating through Rigel scores rather than cross-entropy on references.
  • Reference-free evaluation becoming stronger would reduce dependence on expensive reference caption collections during model development.

Load-bearing premise

The distilled scoring head from the frozen LLM successfully isolates human judgment signals in its own space without needing the original large vocabulary.

What would settle it

On a held-out benchmark such as ActivityNet-Fact, Rigel fails to exceed prior metrics by a statistically significant margin in correlation with human ratings under reference-free conditions.

Figures

Figures reproduced from arXiv: 2606.29997 by Daichi Yashima, Kazuki Matsuda, Komei Sugiura, Shinnosuke Hirano, Shuitsu Koyama, Yuiga Wada.

Figure 1
Figure 1. Figure 1: Overview of RIGEL. A two-phase frame￾work for human-aligned caption evaluation. In Phase 1, an evaluation-specific scoring head is distilled from a frozen large language model (LLM) to map hidden rep￾resentations to ordinal judgment scores, alleviating the mismatch between the LM vocabulary and the ordinal label set in the original language modeling (LM) head. In Phase 2, the LLM backbone is refined using … view at source ↗
Figure 2
Figure 2. Figure 2: Logit distributions over score tokens (“1”–“5”) [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our proposed two-phase training framework. (i) Scoring head (red block) is trained with five labels using Earth Mover’s Distance (EMD) while the LLM and the LM head are frozen. (ii) The LLM backbone is fine-tuned using human judgments while freezing the scoring head’s parameters. CE represents cross-entropy. approaches have been proposed for image caption￾ing. For example, FLEUR (Lee et al., 20… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on the Nebula dataset. Cases (a)–(b) illustrate successful examples in the reference￾based setting, whereas (c) shows a successful example in the reference-free setting. In contrast, (d) represents a failure case in the reference-free setting. Green values indicate predictions closest to human annotations, and red values denote critical errors. “-” indicates that no reference caption wa… view at source ↗
Figure 5
Figure 5. Figure 5: Examples of successful cases from the VATEX-EVAL dataset. Case (a) illustrates a successful example in the reference-based setting, whereas (b) shows a successful example in the reference-free setting. on the Nebula dataset. We used Nebula for the qualitative analysis because it is a diverse and bal￾anced dataset (Matsuda et al., 2024). Cases (a) and (b) show successful cases in the reference-based setting… view at source ↗
Figure 6
Figure 6. Figure 6: The logit distribution over score tokens [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
read the original abstract

Automatic evaluation of image and video captioning is essential for benchmarking multimodal systems, although standard evaluation metrics show limited alignment with human judgments. Recent approaches using large language models (LLMs), commonly referred to as LLM-as-a-Judge, have improved alignment with human judgments but still suffer from a mismatch between large-vocabulary language modeling and evaluation over a small label set. To address this, we propose Rigel, an automatic evaluation metric for image and video captioning, based on self-distilled score adaptation. The metric employs an evaluation-specific scoring head distilled from a frozen LLM, which captures judgment signals in a task-aligned space without relying on large-vocabulary token sets. We then refine the LLM backbone with human judgment data. To train Rigel, we constructed the Vid-Lepus dataset, which contains 3,338 video clips, 33,380 reference captions, and 5,637 candidate captions. Experiments on multiple benchmarks show that Rigel outperforms state-of-the-art metrics, achieving over 10-point improvements on ActivityNet-Fact in the reference-free setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents Rigel, a new automatic evaluation metric for image and video captioning based on self-distilled score adaptation. It distills an evaluation-specific scoring head from a frozen LLM to operate in a task-aligned space without relying on large-vocabulary token sets, constructs the Vid-Lepus dataset (3,338 video clips, 33,380 reference captions, 5,637 candidate captions) for training with human judgments, refines the LLM backbone, and reports outperforming state-of-the-art metrics with over 10-point improvements on ActivityNet-Fact in the reference-free setting.

Significance. If the results hold with proper validation, this work could advance automatic evaluation in multimodal systems by addressing the known mismatch in LLM-as-a-Judge approaches. The Vid-Lepus dataset is a concrete contribution that supports training and benchmarking of judgment-aligned metrics. The self-distillation strategy for creating a task-aligned scoring head is a clear technical strength.

major comments (1)
  1. Abstract: the central performance claim of >10-point gains on ActivityNet-Fact (reference-free) is load-bearing, yet the provided description contains no experimental section, ablation studies, or statistical significance tests to support it; this prevents verification that the distilled scoring head, rather than dataset-specific fitting, drives the result.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and the opportunity to clarify the experimental support for our claims. The abstract is a concise summary; the full manuscript provides the requested details.

read point-by-point responses
  1. Referee: [—] Abstract: the central performance claim of >10-point gains on ActivityNet-Fact (reference-free) is load-bearing, yet the provided description contains no experimental section, ablation studies, or statistical significance tests to support it; this prevents verification that the distilled scoring head, rather than dataset-specific fitting, drives the result.

    Authors: We agree that abstracts omit full experimental details by design. The complete manuscript includes Section 4 (Experiments) with benchmark results on ActivityNet-Fact (reference-free) showing the reported gains, Section 4.3 with ablations isolating the self-distilled scoring head, and statistical significance via paired tests in the result tables. Generalization is demonstrated by evaluating on held-out datasets distinct from Vid-Lepus training data, with ablations confirming the scoring head's contribution beyond dataset fitting. revision: no

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The described pipeline distills an evaluation-specific scoring head from a frozen LLM to operate in a task-aligned space, constructs a new Vid-Lepus dataset of video clips and captions, and refines the backbone on human judgments before reporting empirical gains on separate benchmarks such as ActivityNet-Fact. No equations, fitted-input predictions, self-citation chains, or uniqueness theorems are present in the text that would reduce any claimed result to its own inputs by construction. The central performance claims therefore remain independent of the training procedure in the supplied description.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, background axioms, or new entities; all details are absent.

pith-pipeline@v0.9.1-grok · 5743 in / 1083 out tokens · 43288 ms · 2026-06-30T06:52:37.004188+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 4 canonical work pages · 3 internal anchors

  1. [1]

    AuroraCap: Efficient, Performant Video De- tailed Captioning and a New Benchmark. InICLR. David Chan, Suzanne Petryk, Joseph Gonzalez, Trevor Darrell, and John Canny. 2023. CLAIR: Evaluating Image Captions with Large Language Models. In EMNLP, pages 13638–13646. Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang,...

  2. [2]

    Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics.JAIR, 47:853– 899. Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InICLR. Nakamasa Inoue, Kanoko Goto, Masanari Oi, Martyna Gruszka, Mahiro Ukai, Takumi Hir...

  3. [3]

    InACCV, pages 3570–3586

    DENEB: A Hallucination-Robust Automatic Evaluation Metric for Image Captioning. InACCV, pages 3570–3586. Masanari Ohi, Masahiro Kaneko, Naoaki Okazaki, and Nakamasa Inoue. 2024. HarmonicEval: Multi- modal, Multi-task, Multi-criteria Automatic Evalua- tion Using a Vision Language Model.arXiv preprint arXiv:2412.14613. Gabriel Oliveira, Esther Colombini, an...

  4. [4]

    InW-NUT, pages 351–360

    CIDEr-R: Robust Consensus-based Image De- scription Evaluation. InW-NUT, pages 351–360. Kishore Papineni, Salim Roukos, Todd Ward, and Wei Zhu. 2002. BLEU: a Method for Automatic Evalua- tion of Machine Translation. InACL, pages 311–318. Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas

  5. [5]

    Sara Sarto, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara

    The Earth Mover’s Distance as a Metric for Image Retrieval.International Journal of Computer Vision, 40(2):99–121. Sara Sarto, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2023. Positive- Augmented Contrastive Learning for Image and Video Captioning Evaluation. InCVPR, pages 6914– 6924. Sara Sarto, Marcella Cornia, Lorenzo Barald...

  6. [6]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini: A Family of Highly Capable Multi- modal Models.arXiv preprint arXiv:2312.11805. Tony Cheng Tong, Sirui He, Zhiwen Shao, and Dit- Yan Yeung. 2025. G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o. InAAAI, pages 7419–7427. Ramakrishna Vedantam, Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based Image De- ...

  7. [7]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    JaSPICE: Automatic Evaluation Metric Using Predicate-Argument Structures for Image Captioning Models. InCoNLL, pages 424–435. Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, and 65 others. 2025. In- ternvl3.5: Advancing open-source multimodal mod- els in versatility, reasoning, ...

  8. [8]

    and JaSPICE (Wada et al., 2023), have also been proposed to improve robustness or adapt eval- uation to specific settings. Although these metrics remain standard in the literature, prior studies have shown that they often correlate only weakly with human judgments, especially when captions are semantically correct but lexically diverse (Hessel et al., 202...

  9. [9]

    improve this paradigm by adapting CLIP- based scoring to image caption evaluation. Other approaches, such as ViLBERTScore (Lee et al., 2020), UMIC (Lee et al., 2021), Polos (Wada et al., 2024), and DENEB (Matsuda et al., 2024), further Qwen2.5-VL-3BLLaVA-OneVision-1.5-8BQwen3-VL-2B InternVL-3.5-2B [%] Figure 6: The logit distribution over score tokens (“1...

  10. [12]

    Compare the generated caption to video frames

  11. [14]

    Your score is Full prompt for video captioning in the reference-based setting Evaluate the quality of a video caption based on video frames and reference captions

    Assign ONE score from 1 to 5 Generated Caption: cand Please output only a single integer from 1 to 5, without any explanation or formatting. Your score is Full prompt for video captioning in the reference-based setting Evaluate the quality of a video caption based on video frames and reference captions. Evaluation Criteria: - Score ranges from 1 to 5 - 1:...

  12. [15]

    Examine the video frames to understand the main content

  13. [16]

    Assess how accurately the caption describes the video

  14. [17]

    Compare the generated caption to both video frames and references

  15. [19]

    Your score is Full prompt for image captioning in the reference-free setting Evaluate the quality of a image caption based on image

    Assign ONE score from 1 to 5 Reference Captions: refs_text Generated Caption: cand Please output only a single integer from 1 to 5, without any explanation or formatting. Your score is Full prompt for image captioning in the reference-free setting Evaluate the quality of a image caption based on image. Evaluation Criteria: - Score ranges from 1 to 5 - 1: ...

  16. [22]

    Compare the generated caption to image

  17. [24]

    Your score is Full prompt for image captioning in the reference-based setting Evaluate the quality of a image caption based on image and reference captions

    Assign ONE score from 1 to 5 Generated Caption: cand Please output only a single integer from 1 to 5, without any explanation or formatting. Your score is Full prompt for image captioning in the reference-based setting Evaluate the quality of a image caption based on image and reference captions. Evaluation Criteria: - Score ranges from 1 to 5 - 1: Comple...

  18. [25]

    Examine the image to understand the main content

  19. [26]

    Assess how accurately the caption describes the image

  20. [27]

    Compare the generated caption to both im- age and references

  21. [28]

    Assess coverage of main points and rele- vance

  22. [29]

    Your score is H Additional Details for ARR Checklist Discuss the License for Artifacts.RIGELand Vid-Lepus are released under the BSD 3-Clause Clear License

    Assign ONE score from 1 to 5 Reference Captions: refs_text Generated Caption: cand Please output only a single integer from 1 to 5, without any explanation or formatting. Your score is H Additional Details for ARR Checklist Discuss the License for Artifacts.RIGELand Vid-Lepus are released under the BSD 3-Clause Clear License. The licenses of the models an...