arxiv: 2604.25855 · v2 · submitted 2026-04-28 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

Hector G. Rodriguez , Marcus Rohrbach

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords selective predictionvisual question answeringout-of-distributionmultimodal large language modelsvisual evidencegeneralizationcoverage

0 comments

The pith

SIEVES improves selective prediction coverage up to threefold on out-of-distribution VQA by scoring visual evidence quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that selective prediction in multimodal visual question answering can generalize better to out-of-distribution cases by using a selector that scores the quality of localized visual evidence produced by the reasoner. This replaces reliance on unavailable internal signals like logits with information from inputs and outputs only. If true, it would allow higher coverage at fixed risk levels on real-world benchmarks and enable application to proprietary models such as o3 and Gemini-3-Pro. The gains are shown to hold without any benchmark- or model-specific training.

Core claim

SIEVES shows that selective prediction generalizes through visual evidence scoring. The method requires reasoner models to output localized visual evidence alongside answers, then uses a selector trained on inputs and outputs to estimate the quality of that evidence. This yields up to three times better coverage on OOD benchmarks such as V* Bench, HR-Bench-8k, MME-RealWorld-Lite, VizWiz, and AdVQA compared to non-grounding baselines, and transfers to proprietary reasoners without access to weights or logits.

What carries the argument

The SIEVES selector, which learns to estimate the quality of localized visual evidence from model inputs and outputs alone.

If this is right

Selective predictors can now operate on closed-source MLLMs without logits or hidden states.
Coverage improvements hold across multiple OOD VQA benchmarks without benchmark-specific training.
The approach provides gains beyond what accuracy improvements alone would deliver.
Generalization to different reasoner models like Pixel-Reasoner, o3, and Gemini-3-Pro.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Explicit visual grounding may offer a pathway to uncertainty estimation that scales to future multimodal systems.
This could inspire similar evidence-based selectors for other tasks like image captioning or visual reasoning chains.
If localization quality correlates with correctness, it suggests training objectives that reward explicit evidence production might improve overall reliability.

Load-bearing premise

The quality of localized visual evidence produced by the reasoner serves as a reliable proxy for answer correctness, allowing a selector trained only on inputs and outputs to accurately estimate it.

What would settle it

If applying the SIEVES selector to a reasoner that produces poor or unrelated visual evidence fails to improve coverage over baselines, or if correct answers are consistently paired with low-quality evidence.

Figures

Figures reproduced from arXiv: 2604.25855 by Hector G. Rodriguez, Marcus Rohrbach.

**Figure 1.** Figure 1: Selective prediction through Visual Evidence Scoring. view at source ↗

**Figure 2.** Figure 2: A selective prediction framework must output an answer and a confidence score to a visual question. Top: Standard implicit confidence estimation performs language-only reasoning before answering. The selector model is trained to output a scalar confidence score, given the entire conversation {Question, Image, Reasoning, Answer}. Without evidence, the selector must solve the visual task entirely again or … view at source ↗

**Figure 3.** Figure 3: OOD coverage at varying risk levels for frontier proprietary reason view at source ↗

**Figure 4.** Figure 4: Qualitative examples in high-stakes settings where SIEVES correctly view at source ↗

**Figure 5.** Figure 5: Additional qualitative examples where SIEVES correctly accepts or view at source ↗

**Figure 6.** Figure 6: Thyme multiple-choice distractor generation prompt. view at source ↗

**Figure 7.** Figure 7: Prompt for reasoner with localization. We prompt the reasoner with the tool definition and the guidelines presented here, which encourage it to use the zoom-in tool to provide visual evidence for its answer view at source ↗

**Figure 8.** Figure 8: Open-ended judge prompt. For open-ended questions, we use this prompt when normalized exact match fails to decide whether the predicted answer should still count as correct. This makes correctness labels more robust to valid rewordings and semantically equivalent answers that exact match would reject. We prompt Qwen3- 8B with a single user message at temperature 0. If multiple ground-truth answers are avai… view at source ↗

**Figure 9.** Figure 9: Crop-answer grounding coherence labeling prompt for selector train view at source ↗

**Figure 10.** Figure 10: Localization annotation prompt for SIEVES training. view at source ↗

read the original abstract

Multimodal large language models (MLLMs) achieve ever-stronger performance on visual-language tasks. Even as traditional visual question answering (VQA) benchmarks approach saturation, reliable deployment requires satisfying low error tolerances in real-world, out-of-distribution (OOD) scenarios. Precisely, selective prediction aims to improve coverage, i.e. the share of inputs the system answers, while adhering to a user-defined risk level. This is typically achieved by assigning a confidence score to each answer and abstaining on those that fall below a certain threshold. Existing selective prediction methods estimate implicit confidence scores, relying on model internal signals like logits or hidden representations, which are not available for frontier closed-sourced models. To enable reliable generalization in VQA, we require reasoner models to produce localized visual evidence while answering, and design a selector that explicitly learns to estimate the quality of the localization provided by the reasoner using only model inputs and outputs. We show that SIEVES (Selective Prediction through Visual Evidence Scoring) improves coverage by up to three times on challenging OOD benchmarks (V* Bench, HR-Bench-8k, MME-RealWorld-Lite, VizWiz, and AdVQA), compared to non-grounding baselines. Beyond better generalization to OOD tasks, the design of the SIEVES selector enables transfer to proprietary reasoners without access to their weights or logits, such as o3 and Gemini-3-Pro, providing coverage boosts beyond those attributable to accuracy alone. We highlight that SIEVES generalizes across all tested OOD benchmarks and reasoner models (Pixel-Reasoner, o3, and Gemini-3-Pro), without benchmark- or reasoner-specific training or adaptation. Code is publicly available at https://github.com/hector-gr/SIEVES .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SIEVES gives a selector that scores visual evidence quality from inputs and outputs alone, letting selective prediction run on closed MLLMs.

read the letter

The main thing to know is that this paper trains a selector to judge how good the localized visual evidence looks, using only the question, image, and answer text. That design lets it do selective prediction on closed models like o3 and Gemini-3-Pro where you cannot get logits or hidden states. They report up to three times higher coverage at fixed risk on five OOD benchmarks compared with non-grounding baselines, and the gains hold across Pixel-Reasoner, o3, and Gemini without any per-benchmark retraining. Releasing the code is a clear positive for anyone who wants to inspect the implementation.

Referee Report

3 major / 2 minor

Summary. The paper introduces SIEVES, a selective prediction method for multimodal LLMs on VQA tasks. Reasoner models are prompted to produce localized visual evidence alongside answers; a selector is then trained solely on inputs and outputs to score evidence quality and decide abstention, aiming to raise coverage at a fixed user-specified risk. Experiments report up to 3x coverage gains over non-grounding baselines on five OOD benchmarks (V* Bench, HR-Bench-8k, MME-RealWorld-Lite, VizWiz, AdVQA) across open and closed reasoners (Pixel-Reasoner, o3, Gemini-3-Pro) without benchmark- or model-specific retraining, with code released.

Significance. If the central results hold, the work offers a practical route to selective prediction for black-box frontier models that lack logit or hidden-state access, addressing a key barrier to reliable OOD deployment. The emphasis on explicit visual evidence and the public code release are notable strengths that support reproducibility and extension.

major comments (3)

[Experiments] Experiments section: The reported coverage gains of up to 3x lack any mention of statistical significance testing, confidence intervals, or variance estimates across the five OOD benchmarks and three reasoner families; without these, it is unclear whether the improvements exceed what could arise from accuracy differences or implementation variance in the baselines.
[Method] Method section (selector description): The claim that the selector estimates localization quality from inputs/outputs alone requires explicit details on the feature set, training objective, and data distribution used for the selector; absent these, it is difficult to evaluate whether the selector recovers the assumed correlation between evidence quality and correctness or instead exploits spurious surface patterns.
[Experiments] Experiments section: Potential confounding between accuracy improvements and selective-prediction gains is not addressed via ablations (e.g., accuracy-threshold baselines or oracle evidence-quality selectors); this leaves open whether the coverage lift is attributable to the evidence-scoring mechanism or to other factors such as prompting changes.

minor comments (2)

[Abstract] Abstract: The exact risk threshold(s) at which the 3x coverage figures are measured should be stated explicitly for reproducibility.
[Figures] Figure captions: Ensure all plots of coverage vs. risk include clear axis labels, legend entries for each baseline, and the precise risk value used for the headline comparisons.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where revisions are needed, we have updated the manuscript accordingly.

read point-by-point responses

Referee: [Experiments] Experiments section: The reported coverage gains of up to 3x lack any mention of statistical significance testing, confidence intervals, or variance estimates across the five OOD benchmarks and three reasoner families; without these, it is unclear whether the improvements exceed what could arise from accuracy differences or implementation variance in the baselines.

Authors: We agree that including statistical significance measures would enhance the robustness of our claims. In the revised version, we will add bootstrap confidence intervals (with 1000 resamples) for the coverage improvements on each benchmark and across reasoner models. Additionally, we will report standard deviations from multiple training seeds for the selector where applicable. These additions will demonstrate that the observed gains are statistically significant and not attributable to variance. revision: yes
Referee: [Method] Method section (selector description): The claim that the selector estimates localization quality from inputs/outputs alone requires explicit details on the feature set, training objective, and data distribution used for the selector; absent these, it is difficult to evaluate whether the selector recovers the assumed correlation between evidence quality and correctness or instead exploits spurious surface patterns.

Authors: We appreciate this request for clarification. The revised method section will explicitly describe the selector's feature set, which consists of the input question, the generated answer, and the textual description of the localized visual evidence. The training objective is a binary classification loss to predict whether the evidence is of high quality (correlating with answer correctness), trained on a held-out set of in-distribution VQA examples with automatically generated labels based on evidence overlap with ground-truth regions. The data distribution is detailed as coming from standard VQA datasets like VQAv2, ensuring no overlap with the OOD test benchmarks. This setup ensures the selector learns genuine evidence quality rather than surface patterns. revision: yes
Referee: [Experiments] Experiments section: Potential confounding between accuracy improvements and selective-prediction gains is not addressed via ablations (e.g., accuracy-threshold baselines or oracle evidence-quality selectors); this leaves open whether the coverage lift is attributable to the evidence-scoring mechanism or to other factors such as prompting changes.

Authors: We acknowledge the potential for confounding factors. In the updated experiments section, we will include additional ablations: (1) an accuracy-threshold baseline using the reasoner model's self-reported confidence scores, and (2) an oracle selector with access to ground-truth evidence quality labels. These will isolate the contribution of our evidence-scoring approach. We also clarify that all methods, including baselines, use the same prompting strategy for the reasoner, with the only difference being the presence of the selector for abstention decisions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical selector training on inputs/outputs yields experimental coverage gains without reducing to self-referential definitions or fitted inputs by construction.

full rationale

The paper's central contribution is an empirical method: reasoners produce localized visual evidence, a selector is trained on observable inputs and outputs to score evidence quality, and coverage improvements are measured on OOD benchmarks. These results are experimental outcomes from standard supervised training and evaluation, not mathematical derivations or predictions that equate to their own inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps for the core claim, and the selector's operation does not tautologically presuppose the reported gains. The approach remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that visual grounding quality correlates with answer reliability and on standard supervised learning assumptions for the selector model.

free parameters (1)

risk threshold
User-specified risk level that determines the coverage-risk operating point; exact values appear chosen per benchmark.

axioms (1)

domain assumption Visual grounding quality is a reliable proxy for answer correctness in VQA
Invoked to justify using evidence scoring as the confidence signal.

pith-pipeline@v0.9.0 · 5628 in / 1244 out tokens · 68047 ms · 2026-05-15T06:10:53.252878+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the selector predicts three signals: correctness (is the final answer accurate?), localization (did the model look at the right part of the image?), and coherence (does the visual evidence support the answer?) ... L=λ_corr ·BCE(c_corr, y) +λ_loc ·BCE(c_loc, gloc) +λ_coh ·BCE(c_coh, gcoh)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SIEVES generalizes across all tested OOD benchmarks and reasoner models ... without benchmark- or reasoner-specific training or adaptation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 11 internal anchors

[1]

Anthropic: System card: Claude opus 4 & claude sonnet 4.https://www- cdn.anthropic.com/6d8a8055020700718b0c49369f60816ba2a7c285.pdf (2025)

work page 2025
[2]

In: ICCV (2015),https://doi.org/ 10.1109/ICCV.2015.279

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Vqa: Visual question answering. In: ICCV (2015),https://doi.org/ 10.1109/ICCV.2015.279

work page doi:10.1109/iccv.2015.279 2015
[3]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

In: NeurIPS (2024), https://doi.org/10.52202/079017-3970, URLhttps://doi.org/10

Belinkov, Y., Gurnee, W., Nanda, N., Sachan, M., Song, X., Stolfo, A., Wu, B.: Confidence regulation neurons in language models. In: NeurIPS (2024), https://doi.org/10.52202/079017-3970, URLhttps://doi.org/10. 52202/079017-3970

work page doi:10.52202/079017-3970 2024
[5]

In: NeurIPS (2024),https://doi

Bhatt, U., Collins, K., Dooley, S., Goldblum, M., Gruver, N., Kapoor, S., Pal, A., Roberts, M., Weller, A., Wilson, A.: Large language models must be taught to know what they don’t know. In: NeurIPS (2024),https://doi. org/10.52202/079017-2729, URLhttps://doi.org/10.52202/079017- 2729

work page doi:10.52202/079017-2729 2024
[6]

In: Bouamor, H., Pino, J., Bali, K

Chen, J., Yoon, J., Ebrahimi, S., Arik, S., Pfister, T., Jha, S.: Adaptation with self-evaluation to improve selective prediction in LLMs. In: Bouamor, H., Pino, J., Bali, K. (eds.) Findings of the Association for Computa- tional Linguistics: EMNLP 2023, pp. 5190–5213, Association for Compu- tational Linguistics, Singapore (Dec 2023),https://doi.org/10.18...

work page 2023
[7]

Training Deep Nets with Sublinear Memory Cost

Chen, T., Xu, B., Zhang, C., Guestrin, C.: Training deep nets with sub- linear memory cost. arXiv preprint arXiv:1604.06174 (2016), URLhttps: //arxiv.org/abs/1604.06174

work page internal anchor Pith review Pith/arXiv arXiv 2016
[8]

IEEE Trans

Chow, C.K.: On optimum recognition error and reject tradeoff. IEEE Trans. Inf. Theory16(1), 41–46 (1970),https://doi.org/10.1109/TIT.1970. 1054406

work page doi:10.1109/tit.1970 1970
[9]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plap- pert, M., Tworek, J., Hilton, J., Nakano, R., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

In: NeurIPS (2016), URLhttps://proceedings.neurips.cc/paper_files/paper/ 2016/file/7634ea65a4e6d9041cfd3f7de18e334a-Paper.pdf

Cortes, C., DeSalvo, G., Mohri, M.: Boosting with abstention. In: NeurIPS (2016), URLhttps://proceedings.neurips.cc/paper_files/paper/ 2016/file/7634ea65a4e6d9041cfd3f7de18e334a-Paper.pdf

work page 2016
[11]

arXiv preprint arXiv:2306.08751 (2023), URLhttps://arxiv.org/abs/2306.08751 SIEVES: Selective Prediction via Visual Evidence Scoring 17

Dancette, C., Whitehead, S., Maheshwary, R., Vedantam, R., Scherer, S., Chen, X., Cord, M., Rohrbach, M.: Improving selective visual question answering by learning from your peers. arXiv preprint arXiv:2306.08751 (2023), URLhttps://arxiv.org/abs/2306.08751 SIEVES: Selective Prediction via Visual Evidence Scoring 17

work page arXiv 2023
[12]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Dao, T.: Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691 (2023), URLhttps: //arxiv.org/abs/2307.08691

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Eyes, B.M.: Be my ai.https://www.bemyeyes.com/be-my-ai(2026), ac- cessed on 03/05/2026

work page 2026
[14]

In: NeurIPS (2024),https://doi.org/ 10.52202/079017-4423, URLhttps://doi.org/10.52202/079017-4423, neurIPS 2024

Fu, X., Hu, Y., Krishna, R., Ostendorf, M., Roth, D., Shi, W., Smith, N., Zettlemoyer, L.: Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. In: NeurIPS (2024),https://doi.org/ 10.52202/079017-4423, URLhttps://doi.org/10.52202/079017-4423, neurIPS 2024

work page doi:10.52202/079017-4423 2024
[15]

In: NeurIPS (2017)

Geifman, Y., El-Yaniv, R.: Selective classification for deep neural networks. In: NeurIPS (2017)

work page 2017
[16]

In: Chaudhuri, K., Salakhutdinov, R

Geifman, Y., El-Yaniv, R.: Selectivenet: A deep neural network with an in- tegrated reject option. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceed- ings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 97, pp. 2151–2159, PMLR (09–15 Jun 2019), URLhttps://proceedings.mlr.press/v97/geifman19a.html

work page 2019
[17]

Google: Lookout: Assisted vision.https : / / support . google . com / accessibility/android/answer/9031274(2026), accessed on 03/05/2026

work page arXiv 2026
[18]

In: CVPR (2017),https://doi.org/10.1109/CVPR.2017.670

Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: CVPR (2017),https://doi.org/10.1109/CVPR.2017.670

work page doi:10.1109/cvpr.2017.670 2017
[19]

Groot, T., Valdenegro-Toro, M.: Overconfidence is key: Verbalized uncer- tainty evaluation in large language and vision-language models (2024), URL https://arxiv.org/abs/2405.02917

work page arXiv 2024
[20]

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW) pp

Gu, S., Lugmayr, A., Danelljan, M., Fritsche, M., Lamour, J., Timofte, R.: Div8k: Diverse 8k resolution image dataset. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW) pp. 3512–3516 (2019)

work page 2019
[21]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

In: CVPR (2018),https://doi.org/10.1109/CVPR.2018

Gurari, D., et al.: Vizwiz grand challenge: Answering visual questions from blind people. In: CVPR (2018),https://doi.org/10.1109/CVPR.2018. 00380

work page doi:10.1109/cvpr.2018 2018
[23]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.0968510(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

In: ICML 2025 Workshop on Reliable and Responsible Foundation Models (2025), URLhttps://openreview.net/forum?id=Ub3eXwQ0uA

Jang, C., Choi, M., Kim, Y., Lee, H., Lee, J.: Verbalized confidence triggers self-verification : Emergent behavior without explicit reasoning supervision. In: ICML 2025 Workshop on Reliable and Responsible Foundation Models (2025), URLhttps://openreview.net/forum?id=Ub3eXwQ0uA

work page 2025
[25]

Why Language Models Hallucinate

Kalai, A.T., Nachum, O., Vempala, S.S., Zhang, E.: Why language models hallucinate. arXiv preprint arXiv:2509.04664 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

In: CVPR

Khan, Z., Fu, Y.: Consistency and uncertainty: Identifying unreliable re- sponses from black-box vision-language models for selective visual ques- tion answering. In: CVPR, pp. 10854–10863 (2024),https://doi.org/10. 1109/CVPR52733.2024.01032 18 H. G. Rodriguez and M. Rohrbach

work page arXiv 2024
[27]

In: Proceedings of the IEEE/CVF international conference on computer vision, pp

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 4015–4026 (2023)

work page 2023
[28]

Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike,J.,Schulman,J.,Sutskever,I.,Cobbe,K.:Let’sverifystepbystep.In: The Twelfth International Conference on Learning Representations (2024)

work page 2024
[29]

Lucarini, D., Pelz-Sharpe, A.: Intelligent document processing market analysis 2025-2028: Idp at the crossroads. Tech. rep., Deep Analysis (2025), URLhttps://www.deep-analysis.net/intelligent-document- processing-market-analysis-2025-2028/

work page 2025
[30]

In: Finlayson, G.D., Triantaphillidou, S

Müller, P., Brummel, M., Braun, A.: Spatial recall index for machine learning algorithms. In: Finlayson, G.D., Triantaphillidou, S. (eds.) Lon- don Imaging Meeting 2021: Imaging for Deep Learning, LIM 2021, online, September 20-22, 2021, pp. 58–62, Society for Imaging Science and Technol- ogy (2021),https://doi.org/10.2352/ISSN.2694-118X.2021.LIM-58, URLh...

work page doi:10.2352/issn.2694-118x.2021.lim-58 2021
[31]

Mushtaq, E., Fabian, Z., Bakman, Y.F., Ramakrishna, A., Soltanolkotabi, M., Avestimehr, S.: Harmony: Hidden activation representations and model output-aware uncertainty estimation for vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2025), URLhttps:// openaccess.thecvf.com/co...

work page 2025
[32]

OpenAI: Gpt-5 system card.https://cdn.openai.com/gpt-5-system- card.pdf(2025)

work page 2025
[33]

com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini- system-card.pdf(2025)

OpenAI: Openai o3 and o4-mini system card.https://cdn.openai. com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini- system-card.pdf(2025)

work page 2025
[34]

OpenAI: Thinking with images.https://openai.com/index/thinking- with-images/(2025)

work page 2025
[35]

Sheng, S., Singh, A., Goswami, V., Magana, J.A.L., Galuba, W., Parikh, D., Kiela, D.: Human-adversarial visual question answering (2021), URL https://arxiv.org/abs/2106.02280

work page arXiv 2021
[36]

org/roe.html(2021)

Shrivastava, A., Goyal, Y., Batra, D., Parikh, D., Agrawal, A.: Real open- ended track leaderboard results, vqa challenge 2021.https://visualqa. org/roe.html(2021)

work page 2021
[37]

Singhi, N., Bansal, H., Hosseini, A., Grover, A., Chang, K.W., Rohrbach, M., Rohrbach, A.: When to solve, when to verify: Compute-optimal problem solving and generative verification for llm reasoning (2025), URLhttps: //arxiv.org/abs/2504.01005

work page arXiv 2025
[38]

selective prediction

Srinivasan, T., Hessel, J., Gupta, T., Lin, B.Y., Choi, Y., Thomason, J., Chandu, K.: Selective “selective prediction”: Reducing unnecessary ab- stention in vision-language reasoning. In: Findings of the Association for Computational Linguistics: ACL 2024, pp. 12935–12948 (2024),https: SIEVES: Selective Prediction via Visual Evidence Scoring 19 / / doi . ...

work page 2024
[39]

Su, A., Wang, H., Ren, W., Lin, F., Chen, W.: Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning (May 2025)

work page 2025
[40]

Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., Rouillard, L., Mes- nard, T., Cideron, G., Grill, J.b., Ramos, S., Yvinec, E., Casbon, M., Pot, E., Penchev, I., Liu, G., Visin, F., Kenealy, K., Beyer, L., Zhai, X., Tsit- sulin, A., Busa-Fekete, R., Feng, A., Sachdeva, N., Colema...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp

Varshney,N.,Baral,C.:Post-abstention:Towardsreliablyre-attemptingthe abstained instances in QA. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 967–982, Association for Computational Linguistics, Toronto, Canada (Jul 2023),https://doi.org/10.18653/v1/2023.acl-long.55, URLhttps: //acl...

work page doi:10.18653/v1/2023.acl-long.55 2023
[42]

pdfreaderpro.com/blog/document-processing-statistics(2025)

Venter, M.: 40 intelligent document processing statistics.https://www. pdfreaderpro.com/blog/document-processing-statistics(2025)

work page 2025
[43]

Wang, W., Ding, L., Zeng, M., Zhou, X., Shen, L., Luo, Y., Tao, D.: Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models (2024), URLhttps:// arxiv.org/abs/2408.15556

work page arXiv 2024
[44]

Transac- tions of the Association for Computational Linguistics13, 529–556 (2025), https://doi.org/10.1162/tacl_a_00754, URLhttps://direct.mit

Wen, B., Yao, J., Feng, S., Xu, C., Tsvetkov, Y., Howe, B., Wang, L.L.: Know your limits: A survey of abstention in large language models. Transac- tions of the Association for Computational Linguistics13, 529–556 (2025), https://doi.org/10.1162/tacl_a_00754, URLhttps://direct.mit. edu/tacl/article/doi/10.1162/tacl_a_00754

work page doi:10.1162/tacl_a_00754 2025
[45]

In: Computer Vision – ECCV 2022 Workshops, pp

Whitehead, S., Petryk, S., Shakib, V., Gonzalez, J., Darrell, T., Rohrbach, A., Rohrbach, M.: Reliable visual question answering: Abstain rather than answer incorrectly. In: Computer Vision – ECCV 2022 Workshops, pp. 148–166, Springer, Cham (2022),https://doi.org/10.1007/978-3-031- 20059-5_9

work page doi:10.1007/978-3-031- 2022
[46]

In: CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

Wu, P., Xie, S.: V*: Guided visual search as a core mechanism in multimodal llms. In: CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

work page 2024
[47]

Xin, J., Tang, R., Yu, Y., Lin, J.: The art of abstention: Selective prediction and error regularization for natural language processing. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL) (2021),https://doi.org/10.18653/v1/2021.acl-long.84, URL https://aclanthology.org/2021.acl-long.84 20 H. G. Rodriguez and...

work page doi:10.18653/v1/2021.acl-long.84 2021
[48]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

In: ICLR Workshop: Quantify Uncertainty and Hallucination in Foundation Models: The Next Frontier in Reliable AI (2025), URL https://openreview.net/forum?id=CVRdNQvFPE

Yang, D., Tsai, Y.H.H., Yamada, M.: On verbalized confidence scores for LLMs. In: ICLR Workshop: Quantify Uncertainty and Hallucination in Foundation Models: The Next Frontier in Reliable AI (2025), URL https://openreview.net/forum?id=CVRdNQvFPE

work page 2025
[50]

In: Vlachos, A., Augen- stein, I

Yoshikawa, H., Okazaki, N.: Selective-LAMA: Selective prediction for confidence-aware evaluation of language models. In: Vlachos, A., Augen- stein, I. (eds.) Findings of the Association for Computational Linguistics: EACL 2023, pp. 2017–2028, Association for Computational Linguistics, Dubrovnik, Croatia (May 2023),https://doi.org/10.18653/v1/2023. finding...

work page doi:10.18653/v1/2023 2023
[51]

Zhang, Y.F., Lu, X., Yin, S., Fu, C., Chen, W., Hu, X., Wen, B., Jiang, K., Liu, C., Zhang, T., Fan, H., Chen, K., Chen, J., Ding, H., Tang, K., Zhang, Z., Wang, L., Yang, F., Gao, T., Zhou, G.: Thyme: Think beyond images (2025), URLhttps://arxiv.org/abs/2508.11630

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Zhang, Y.F., Zhang, H., Tian, H., Fu, C., Zhang, S., Wu, J., Li, F., Wang, K., Wen, Q., Zhang, Z., Wang, L., Jin, R., Tan, T.: Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans? (2024), URLhttps://arxiv.org/abs/2408.13257

work page arXiv 2024
[53]

Zheng, Y., Zhang, R., Zhang, J., Ye, Y., Luo, Z., Feng, Z., Ma, Y.: Lla- mafactory: Unified efficient fine-tuning of 100+ language models. In: Pro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Association for Compu- tational Linguistics, Bangkok, Thailand (2024), URLhttp://arxiv.or...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Thinking with Images

Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learn- ing (May 2025)

work page 2025
[55]

Pooled” evaluates C@5 on the set of question-answer pairs aggregated over all repetitions. “Avg. per-rep

Zhu, F., Lei, W., Feng, F., Wang, C., Zhang, H., Chua, T.S.: Towards com- plex document understanding by discrete reasoning. In: ACMMM (2022), https://doi.org/10.1145/3503161.3548422, URLhttps://doi.org/ 10.1145/3503161.3548422 SIEVES: Selective Prediction via Visual Evidence Scoring A1 Supplementary Material We briefly describe how the appendix is organi...

work page doi:10.1145/3503161.3548422 2022
[56]

What is the material of the glove?

The exact prompt templates used for distractor options generation, answering with localization, correctness judging, coherence labeling, and localization anno- tation are shown in Sec. F. C Ablating threshold for binarizing localization Here,wealsoablatethemIoGTthresholdusedtobinarizethelocalizationtarget, which then propagates to the coherence target: gl...

work page
[57]

Note you are not provided this final image, and only the crop, which the model should only use to give the final answer

**Crop Sufficiency**: Is the provided image crop sufficient to support the model’s response? Does it contain all the necessary visual information referenced in the response? If the model explicitly states they use the global view to answer this question, you should consider this as not grounded in the prompt. Note you are not provided this final image, an...

work page
[58]

red car" -> \boxed{Yes} - If the crop shows a partial view that doesn’t contain enough information to answer -> \boxed{No} - If the crop shows a dog but the model answers

**Answer Coherence**: Is the model’s response coherent with what is actually visible in the image? Or is the model hallucinating information or obtaining it from elsewhere (not from the image)? Think step by step about both aspects, then provide your final assessment. Output your final decision as \boxed{Yes} if the answer is well-grounded in the image cr...

work page