Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking

Jessie Salas; Qinwu Xu; Zhuoheng Li

REVIEW 2 major objections 2 minor 2 cited by

Checkpoint selection for multimodal LLMs improves by treating it as a decision problem under evaluation uncertainty with staged ranking and subsampling.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-20 20:38 UTC pith:LPGBYPFP

load-bearing objection The paper lays out a multi-stage LLM-judge pipeline for MLLM checkpoint selection with subsampling and OCR focus, but offers no results to show it actually delivers reliable uncertainty estimates. the 2 major comments →

arxiv 2605.18852 v1 pith:LPGBYPFP submitted 2026-05-13 cs.LG cs.AIcs.CL

Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking

Qinwu Xu , Zhuoheng Li , Jessie Salas This is my paper

classification cs.LG cs.AIcs.CL

keywords checkpoint selectionmultimodal LLMsrobust evaluationLLM-based judgmentuncertainty estimationOCR readabilityranking protocolsmodel selection

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the problem of picking among multimodal large language model checkpoints when performance differences are small and evaluation signals contain noise, particularly in tasks involving OCR. It reframes the task as a robust decision problem instead of relying on static benchmarks or single scores that often fail to match real-world conditions. The proposed multi-stage framework combines curated real-world data, structured judgments from an LLM judge, pointwise filtering, listwise ranking, and pairwise comparisons, while adding subsampling for confidence estimates and percentile scoring to account for distributional behavior and avoid tail failures. A reader would care because unreliable checkpoint choices waste compute during training and produce models that perform inconsistently outside controlled tests. The work also highlights that OCR readability of the evaluation data itself strongly affects how trustworthy the entire process is.

Core claim

Checkpoint selection for multimodal large language models is formulated as a robust decision problem under evaluation uncertainty. The solution is a multi-stage framework that integrates curated real-world data, structured LLM-based judgment, and progressive ranking protocols consisting of pointwise filtering, listwise ranking, and pairwise comparison. Subsampling-based confidence estimation and a percentile-based scoring method capture distributional properties and penalize tail failures, while data quality measured by OCR readability is shown to be a critical factor for evaluation validity.

What carries the argument

The multi-stage evaluation framework that performs progressive refinement through pointwise filtering, listwise ranking, and pairwise comparison, supported by subsampling-based confidence estimation and percentile-based scoring.

Load-bearing premise

Structured LLM-based judgments combined with subsampling yield reliable uncertainty estimates without the judge model introducing its own systematic biases, and OCR readability serves as a key driver of evaluation validity.

What would settle it

Run the method and baseline selection procedures on the same set of checkpoints, then measure which selected models actually show lower performance variance and higher correlation with human judgments on a fresh collection of real-world tasks with varying OCR quality.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Selections align more closely with in-the-wild usage instead of static benchmark scores.
Subsampling confidence estimates allow meaningful distinctions even when performance margins are small.
Attention to OCR readability in the evaluation data improves the overall trustworthiness of the ranking process.
The staged protocol reduces the chance that a single noisy run determines the final checkpoint choice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same uncertainty-aware ranking approach could apply to selecting models in other noisy domains such as code generation or long-context reasoning.
Treating evaluation as an agentic, multi-stage process may reduce wasted training cycles by enabling earlier and more stable decisions.
Percentile scoring that penalizes tail failures might generalize to other model-selection settings where extreme poor runs matter more than average scores.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Referee Report

2 major / 2 minor

Summary. The paper formulates checkpoint selection for multimodal LLMs as a robust decision problem under evaluation uncertainty. It proposes a multi-stage framework integrating curated real-world data, structured LLM-based judgment, and ranking protocols (pointwise filtering, listwise ranking, pairwise comparison), augmented by subsampling-based confidence estimation and percentile-based scoring that captures distributional characteristics while penalizing tail failures. It further claims that OCR readability is a critical determinant of evaluation validity.

Significance. If empirically validated, the framework could improve robustness in MLLM checkpoint selection for noisy, real-world OCR-heavy tasks by moving beyond static benchmarks. The multi-stage agentic evaluation and stability-aware ranking introduce potentially useful protocols for uncertainty handling. However, the absence of any reported results, baselines, error bars, or ablation studies in the manuscript limits assessment of whether these components deliver measurable gains.

major comments (2)

[Abstract] Abstract: The central claim that structured LLM-based judgment combined with subsampling yields reliable uncertainty estimates and that OCR readability is a critical validity factor rests on untested assumptions about judge-model biases; no experiments, comparisons to human judgments, or bias analyses are provided to support this load-bearing premise.
[Abstract] The multi-stage pipeline (pointwise filtering, listwise ranking, pairwise comparison) and percentile-based scoring formulation are presented without any quantitative validation, cross-validation against existing methods, or sensitivity analysis to judge inconsistencies, undermining the robust decision formulation.

minor comments (2)

Clarify the precise mathematical definition of the percentile-based scoring and how subsampling confidence intervals are computed and integrated into the final ranking.
Provide details on the curated real-world dataset, including size, diversity, and how OCR readability was quantified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and constructive feedback. We agree that the current manuscript would benefit from empirical validation to substantiate the framework's claims, and we will incorporate the suggested experiments and analyses in the revision.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that structured LLM-based judgment combined with subsampling yields reliable uncertainty estimates and that OCR readability is a critical validity factor rests on untested assumptions about judge-model biases; no experiments, comparisons to human judgments, or bias analyses are provided to support this load-bearing premise.

Authors: We acknowledge that the submitted manuscript presents the framework and its motivation without accompanying empirical results to validate the central claims. This is a fair observation. In the revised version we will add a dedicated experiments section that includes: direct comparisons of structured LLM judgments against human annotations on curated real-world samples; bias analyses across multiple judge models; and validation of the subsampling-based confidence estimates against observed performance variability. These additions will provide quantitative support for the reliability of the uncertainty estimates and the role of OCR readability. revision: yes
Referee: [Abstract] The multi-stage pipeline (pointwise filtering, listwise ranking, pairwise comparison) and percentile-based scoring formulation are presented without any quantitative validation, cross-validation against existing methods, or sensitivity analysis to judge inconsistencies, undermining the robust decision formulation.

Authors: We agree that the absence of quantitative validation limits the ability to assess the practical gains of the proposed pipeline and scoring method. We will revise the manuscript to include: end-to-end results on real-world multimodal checkpoint selection tasks; comparisons against standard pointwise and static benchmark baselines; ablation studies isolating each stage of the pipeline; and sensitivity analyses measuring the effect of judge-model inconsistencies. Error bars from repeated subsampling runs will also be reported to demonstrate robustness. revision: yes

Circularity Check

0 steps flagged

No circularity detected; framework proposal is self-contained

full rationale

The paper formulates checkpoint selection as a robust decision problem and proposes a multi-stage framework using curated real-world data, structured LLM-based judgment, pointwise filtering, listwise ranking, pairwise comparison, subsampling-based confidence estimation, and percentile-based scoring. No equations, derivations, or fitted parameters are shown that reduce by construction to the inputs. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The central claims introduce new protocols and demonstrate the importance of OCR readability without redefining quantities circularly or relying on self-referential definitions. The derivation chain is independent and self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central proposal rests on the reliability of LLM judgments as structured evaluators and the representativeness of curated real-world data; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption Structured LLM-based judgment provides unbiased and reliable evaluation signals for multimodal outputs
Invoked to support pointwise filtering, listwise ranking, and pairwise comparison stages.
domain assumption OCR readability is a critical determinant of evaluation validity
Stated as a demonstrated critical factor in the abstract.

pith-pipeline@v0.9.0 · 5681 in / 1374 out tokens · 70034 ms · 2026-05-20T20:38:24.696445+00:00 · methodology

0 comments

read the original abstract

Checkpoint selection for multimodal large language models (MLLMs) presents significant challenges when performance differentials are marginal and evaluation signals are prone to noise. Existing methodologies rely heavily on static benchmarks or pointwise scoring, which frequently misalign with in-the-wild usage and lack robust uncertainty estimation, particularly in OCR-heavy scenarios. In this work, we formulate checkpoint selection as a robust decision problem under evaluation uncertainty. We propose a multi-stage framework that integrates curated real-world data, structured LLM-based judgment, and multi-stage ranking protocols. The evaluation system orchestrates progressive refinement via pointwise filtering, listwise ranking, and pairwise comparison. To enhance reliability, we introduce subsampling-based confidence estimation and a percentile-based scoring formulation that captures distributional characteristics while penalizing tail failures. Furthermore, we demonstrate that data quality, specifically OCR readability, is a critical determinant of evaluation validity.

Figures

Figures reproduced from arXiv: 2605.18852 by Jessie Salas, Qinwu Xu, Zhuoheng Li.

**Figure 2.** Figure 2: Overview of the proposed evaluation framework for robust checkpoint selection through [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Images with ambiguous OCR text and differing model responses to questions. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: VQA pairs with readable OCR text and different model answers (photos — all AI generated [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: 1) structured table understanding and semantic extraction under low-visibility visual [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: 1) Task alignment vs. surface fidelity in information extraction (photo AI generated), 2) [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Knowledge grounding in high-visibility settings (Image source: Screenshot from Google [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

percentile-based scoring formulation: S = P50 − β(P50 − P20) + γ(P80 − P50)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Miller-Index-Based Latent Crystallographic Fracture Plane Reasoning and generation with Vision-Language Models
cs.LG 2026-05 unverdicted novelty 5.0

MLLMs can infer Miller indices from idealized fracture images and correctly reject the representation when the fracture geometry does not support a planar crystallographic interpretation.
Miller-Index-Based Latent Crystallographic Fracture Plane Reasoning and generation with Vision-Language Models
cs.LG 2026-05 unverdicted novelty 4.0

MLLMs can infer Miller indices for idealized fracture planes and reject the representation when physics does not support it, based on experiments across synthetic and real fracture images.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Apicella, A., Isgr‘o, F., Pollastro, A., & Prevete, R. (2026). Don’t stop me now: Rethinking validation criteria for model parameter selection.arXiv preprint arXiv:2602.22107. Chang, et al. (2025). WearVQA: A visual question answering benchmark for wearables in egocentric authentic real-world scenarios. InAdvances in Neural Information Processing Systems ...

work page arXiv 2026
[2]

Liu, Y ., et al. (2023). On the hidden mystery of OCR in large multimodal models.arXiv preprint arXiv:2305.07895. Liu, Y ., et al. (2024). A survey on benchmarks of multimodal large language models.arXiv preprint arXiv:2408.08632. Miller, J. K., et al. (2025). Evaluating LLM metrics through real-world capabilities.arXiv preprint arXiv:2505.08253. Prechelt...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Apicella, A., Isgr‘o, F., Pollastro, A., & Prevete, R. (2026). Don’t stop me now: Rethinking validation criteria for model parameter selection.arXiv preprint arXiv:2602.22107. Chang, et al. (2025). WearVQA: A visual question answering benchmark for wearables in egocentric authentic real-world scenarios. InAdvances in Neural Information Processing Systems ...

work page arXiv 2026

[2] [2]

Liu, Y ., et al. (2023). On the hidden mystery of OCR in large multimodal models.arXiv preprint arXiv:2305.07895. Liu, Y ., et al. (2024). A survey on benchmarks of multimodal large language models.arXiv preprint arXiv:2408.08632. Miller, J. K., et al. (2025). Evaluating LLM metrics through real-world capabilities.arXiv preprint arXiv:2505.08253. Prechelt...

work page internal anchor Pith review Pith/arXiv arXiv 2023