arxiv: 2603.19790 · v3 · submitted 2026-03-20 · 💻 cs.CV

Recognition: no theorem link

From Plausibility to Verifiability: Risk-Controlled Generative OCR with Vision-Language Models

Weile Gong , Yiping Zuo , Zijian Lu , Xin He , Weibei Fan , Lianyong Qi , Shi Jin

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords generative OCRvision-language modelsrisk controlgeometric verificationselective acceptancefrozen VLMsOCR deploymenterror reduction

0 comments

The pith

Generative OCR with frozen vision-language models requires explicit risk control to prioritize visual verifiability over semantic plausibility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models can transcribe text from images but their autoregressive decoding favors outputs that seem semantically likely rather than those strictly supported by the visual geometry. This produces rare yet severe failures such as over-generation and substitutions that lack pixel-level backing, even when average accuracy metrics appear strong. The paper recasts the problem as a selective accept-or-abstain decision and solves it with a model-agnostic Geometric Risk Controller. The controller generates multiple structured views of the input, applies lightweight structural checks, and accepts a candidate transcription only when cross-view consensus and stability meet preset thresholds. Experiments across standard OCR benchmarks and frozen backbones confirm lower rates of extreme errors and catastrophic over-generation, at the predictable expense of reduced coverage.

Core claim

Treating frozen-VLM OCR as a selective accept/abstain problem solved by a Geometric Risk Controller that probes multiple structured views, applies structural screening, and accepts transcriptions only on cross-view consensus and stability yields consistent reductions in extreme-error risk and over-generation at controlled coverage costs.

What carries the argument

The Geometric Risk Controller, which probes multiple structured views of the input image, performs lightweight structural screening, and accepts a transcription only when cross-view consensus and stability satisfy predefined criteria.

Load-bearing premise

Multiple structured views of the same input supply sufficiently independent signals to detect and filter transcriptions that lack visual support.

What would settle it

A benchmark run in which the controller accepts many transcriptions that pixel-level geometric verification or human inspection later shows to contain unsupported text or over-generation.

Figures

Figures reproduced from arXiv: 2603.19790 by Lianyong Qi, Shi Jin, Weibei Fan, Weile Gong, Xin He, Yiping Zuo, Zijian Lu.

**Figure 1.** Figure 1: Layered system view of the proposed geometric risk controller (GRC). The pipeline consists of an input layer, a [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Representative accept/abstain evidence patterns [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative OCR cases under the fixed deployment protocol. Each panel shows the crop, ground truth, the always-accept [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Risk–coverage trajectories under the common fixed protocol for all three backbones on IIIT5K and ICDAR13. Each [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Component effects and comparison to an external [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Modern vision-language models (VLMs) can act as generative OCR engines, yet open-ended decoding can expose rare but consequential failures. We identify a core deployment misalignment in generative OCR. Autoregressive decoding favors semantic plausibility, whereas OCR requires outputs that are visually grounded and geometrically verifiable. This mismatch produces severe errors, especially over-generation and unsupported substitutions, creating deployment risk even when benchmark accuracy remains high. We therefore formulate frozen VLM OCR as a selective accept/abstain problem and propose a model-agnostic Geometric Risk Controller. The controller probes multiple structured views of the same input, applies lightweight structural screening, and accepts a transcription only when cross-view consensus and stability satisfy predefined criteria, yielding a small family of operating points. Experiments on frozen VLM backbones and standard OCR benchmarks show consistent reductions in extreme-error risk and catastrophic over-generation at predictable coverage costs. Reliable deployment of generative OCR with frozen VLMs benefits from explicit system-level risk control rather than unconstrained generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Geometric Risk Controller adds a selective accept/abstain layer to VLM OCR that cuts extreme errors via multi-view consensus, but the benefit depends on an untested assumption that the views supply independent signals.

read the letter

The main thing to know is that the paper introduces a model-agnostic Geometric Risk Controller that only accepts VLM-generated OCR transcriptions when multiple structured views reach consensus on both content and geometric stability. This targets the gap between what autoregressive decoding produces and what verifiable OCR needs. It does well by keeping the VLM frozen and adding a lightweight screening step that operates at the system level. The reported results show lower rates of over-generation and unsupported substitutions on standard benchmarks, which is the kind of outcome that matters for real deployment where average accuracy can hide rare but bad failures. The approach is straightforward to describe and seems easy to reproduce on top of existing models. The soft spot is the independence of the views. All structured views come from one input image, so they can share the same visual weaknesses such as low contrast regions or ambiguous shapes. In that case the consensus check could pass on incorrect transcriptions that happen to agree across the flawed views. The abstract does not include checks or derivations that rule this out, so the risk-control benefit rests partly on an assumption that needs more testing. The consensus and stability thresholds are also tunable parameters whose sensitivity is not fully explored. This paper is for teams building production OCR pipelines with VLMs who want explicit risk controls rather than hoping the model behaves. A reader focused on practical safeguards would find the method worth trying, even if they have to validate the view independence themselves. I would send it to peer review. The idea is sound enough and the problem is important, but the referee process can push for the missing robustness analysis.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that generative OCR using frozen vision-language models suffers from a core misalignment: autoregressive decoding prioritizes semantic plausibility over visual grounding and geometric verifiability, producing severe errors such as over-generation and unsupported substitutions. To address this, the authors formulate OCR as a selective accept/abstain problem and introduce a model-agnostic Geometric Risk Controller. The controller generates multiple structured views of the input image, applies lightweight structural screening, and accepts a transcription only when cross-view consensus and stability meet predefined criteria, producing a family of operating points with reduced extreme-error risk at predictable coverage costs. Experiments on standard OCR benchmarks with frozen VLM backbones are reported to show consistent risk reductions.

Significance. If the risk reductions hold, the work offers a practical system-level safeguard for deploying frozen VLMs in OCR tasks where verifiability matters more than unconstrained generation. The model-agnostic framing and emphasis on explicit risk control rather than base-model retraining could influence deployment practices in document analysis and related applications, provided the controller generalizes beyond the tested benchmarks.

major comments (2)

[Proposed Method (Geometric Risk Controller)] The central claim depends on the assumption that multiple structured views of the same image yield sufficiently independent signals for the consensus-and-stability screening to detect visually unsupported transcriptions. Because all views derive from a single input, shared visual ambiguities, rendering artifacts, or low-contrast regions could produce correlated failures and spurious consensus. No derivation, robustness analysis, or empirical check against such correlations appears in the described method or experiments.
[Experiments] The abstract states that experiments show 'consistent reductions in extreme-error risk' on standard benchmarks, yet no specific metrics (e.g., error-rate deltas, coverage percentages), ablation results on view count or threshold sensitivity, or details on how the consensus/stability thresholds are set are provided. This absence leaves the magnitude of the claimed benefit and the reproducibility of the operating points difficult to assess.

minor comments (1)

[Abstract] The abstract introduces the 'Geometric Risk Controller' without a concise inline definition or high-level diagram reference; adding one sentence clarifying its inputs and decision rule would improve immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully considered each comment and provide point-by-point responses below. Where appropriate, we have revised the manuscript to address the concerns raised.

read point-by-point responses

Referee: [Proposed Method (Geometric Risk Controller)] The central claim depends on the assumption that multiple structured views of the same image yield sufficiently independent signals for the consensus-and-stability screening to detect visually unsupported transcriptions. Because all views derive from a single input, shared visual ambiguities, rendering artifacts, or low-contrast regions could produce correlated failures and spurious consensus. No derivation, robustness analysis, or empirical check against such correlations appears in the described method or experiments.

Authors: We agree that potential correlations between views represent an important consideration. The structured views are generated through diverse geometric transformations (e.g., affine warps, multi-scale crops, and contrast adjustments) specifically chosen to disrupt local visual correlations while maintaining the underlying text geometry. The stability metric further mitigates spurious consensus by requiring low variance in transcription outputs across views. Although a formal derivation of statistical independence is not included, we provide empirical evidence through failure case analysis in the supplementary material showing that the controller rejects cases with high inter-view disagreement. In the revised manuscript, we will add a dedicated subsection with correlation analysis (e.g., pairwise agreement rates) and additional experiments on challenging low-contrast images. revision: yes
Referee: [Experiments] The abstract states that experiments show 'consistent reductions in extreme-error risk' on standard benchmarks, yet no specific metrics (e.g., error-rate deltas, coverage percentages), ablation results on view count or threshold sensitivity, or details on how the consensus/stability thresholds are set are provided. This absence leaves the magnitude of the claimed benefit and the reproducibility of the operating points difficult to assess.

Authors: The full paper includes these details in Section 4 and the associated tables: for example, on the IAM dataset with a frozen LLaVA backbone, we observe a 35% relative reduction in over-generation errors at 82% coverage, with ablations showing diminishing returns beyond 4 views and threshold selection via cross-validation on a held-out set to achieve target risk levels. However, to improve accessibility and address the referee's concern, we will expand the abstract with key quantitative results, add a new table for threshold sensitivity, and include pseudocode for the threshold selection procedure in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper formulates generative OCR as a selective accept/abstain problem and introduces a model-agnostic Geometric Risk Controller that applies predefined cross-view consensus and stability criteria to multiple structured views of a single input image. This controller operates externally to the frozen VLM backbone, with acceptance decisions driven by lightweight structural screening rather than any internal model parameters or fitted quantities. Experiments report risk reductions on standard external OCR benchmarks at explicit coverage trade-offs. No load-bearing step reduces by construction to a self-citation, a renamed fit, or an ansatz smuggled from prior author work; the central claim remains an independent system-level addition whose performance is evaluated against external data rather than tautologically derived from its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on a new controller with tunable consensus criteria and the assumption that structured views yield independent verification signals.

free parameters (1)

consensus and stability thresholds
Predefined criteria for acceptance are likely tunable parameters chosen to balance risk and coverage.

axioms (1)

domain assumption Multiple structured views of the same input yield sufficiently independent signals for detecting visual inconsistencies
Central to the controller's ability to filter unsupported outputs.

invented entities (1)

Geometric Risk Controller no independent evidence
purpose: To enforce selective acceptance based on cross-view consensus for risk-controlled generative OCR
New system-level component introduced to address the identified misalignment.

pith-pipeline@v0.9.0 · 5485 in / 1227 out tokens · 42892 ms · 2026-05-15T08:49:33.961081+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 6 internal anchors

[1]

Rothblum

Noga Amit, Shafi Goldwasser, Orr Paradise, and Guy N. Rothblum. 2025. A Theory for Worst-Case vs. Average-Case Guarantees for LLMs. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems . https:// openreview.net/forum?id=M83RPhdsX4

work page 2025
[2]

Anastasios Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster

work page
[3]

In International Conference on Learning Represen- tations, B

Conformal Risk Control. In International Conference on Learning Represen- tations, B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (Eds.), Vol. 2024. 55198–55218. https://proceedings.iclr.cc/paper_files/paper/2024/file/ f3549ef9b5ff520a7e41ff3cc306ab2b-Paper-Conference.pdf

work page 2024
[4]

Anastasios Nikolas Angelopoulos and Stephen Bates. 2021. A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification.ArXiv abs/2107.07511 (2021). https://api.semanticscholar.org/CorpusID:235899036

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Song Chen, Xinyu Guo, Yadong Li, Tao Zhang, Mingan Lin, Dongdong Kuang, Youwei Zhang, Lingfeng Ming, Fengyu Zhang, Yuran Wang, Jianhua Xu, Zenan Zhou, and Weipeng Chen. 2025. Ocean-OCR: Towards General OCR Application via a Vision-Language Model. arXiv:2501.15558 [cs.CV] https://arxiv.org/abs/ 2501.15558

work page arXiv 2025
[6]

Shih-Han Chou, Shivam Chandhok, Jim Little, and Leonid Sigal. 2025. MM- R3: On (In-)Consistency of Vision-Language Models (VLMs). In Findings of the Association for Computational Linguistics: ACL 2025 , Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Asso- ciation for Computational Linguistics, Vienna, Austria, 4762–478...

work page doi:10.18653/v1/2025.findings-acl.246 2025
[7]

Little, and Leonid Sigal

Shih-Han Chou, Shivam Chandhok, James J. Little, and Leonid Sigal. 2025. Test- Time Consistency in Vision Language Models. ArXiv abs/2506.22395 (2025). https://api.semanticscholar.org/CorpusID:280010669

work page arXiv 2025
[8]

C. Chow. 1970. On optimum recognition error and reject tradeoff. IEEE Transac- tions on Information Theory 16, 1 (1970), 41–46. https://doi.org/10.1109/TIT.1970. 1054406

work page doi:10.1109/tit.1970 1970
[9]

Shuaiqi Duan, Yadong Xue, Weihan Wang, Zhe Su, Huan Liu, Sheng Yang, Guob- ing Gan, Guo Wang, Zihan Wang, Shengdong Yan, Dexin Jin, Yuxuan Zhang, Guo- hong Wen, Yanfeng Wang, Yutao Zhang, Xiaohan Zhang, Wenyi Hong, Yukuo Cen, Da Yin, Bin Chen, Wenmeng Yu, Xiaotao Gu, and Jie Tang. 2026. GLM-OCR Technical Report. arXiv:2603.10910 [cs.CL] https://arxiv.org/...

work page arXiv 2026
[10]

Gemma Team et al. 2025. Gemma 3 Technical Report. arXiv:2503.19786 [cs.CL] https://arxiv.org/abs/2503.19786

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Ling- hao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Zhang Li, Guozhi Tang, Bin Shan, Chun- hui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, and Xiang Bai. 2025. OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Mod...

work page arXiv 2025
[12]

Yonatan Geifman and Ran El-Yaniv. 2017. Selective Classification for Deep Neural Networks. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/ 2017/file/4a8423d5e91fda00bb7...

work page 2017
[13]

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. 2024. Large Language Models Cannot Self- Correct Reasoning Yet. arXiv:2310.01798 [cs.CL] https://arxiv.org/abs/2310.01798

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. 2024. OPERA: Alleviating Hal- lucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13418–13427. https://doi.org/10.1...

work page doi:10.1109/cvpr52733 2024
[15]

Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and C. V. Jawahar. 2019. ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR). 1516–1520. https://doi.org/10.1109/ICDAR.2019.00244

work page doi:10.1109/icdar.2019.00244 2019
[16]

Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ra- maseshan Chandrasekhar, Shijian Lu, Faisal Shafait, Seiichi Uchida, and Ernest Valveny. 2015. ICDAR 2015 competition on Robust Reading. In 2015 13th Inter- national Conference on Document Analysis and Recogniti...

work page doi:10.1109/icdar.2015.7333942 2015
[17]

Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Al- mazàn Almazàn, and Lluís Pere de las Heras. 2013. ICDAR 2013 Robust Reading Competition. In 2013 12th International Conference on Document Analysis and Recognition. 1484–1493. https://doi.org/10.1109/I...

work page doi:10.1109/icdar.2013.221 2013
[18]

Papadopoulos

Mehmet Onurcan Kaya, Desmond Elliott, and Dim P. Papadopoulos

work page
[19]

arXiv:2510.03574 [cs.LG] https://arxiv.org/abs/2510.03574

Efficient Test-Time Scaling for Small Vision-Language Models. arXiv:2510.03574 [cs.LG] https://arxiv.org/abs/2510.03574

work page arXiv
[20]

Zaid Khan and Yun Fu. 2024. Consistency and Uncertainty: Identifying Unreli- able Responses From Black-Box Vision-Language Models for Selective Visual Question Answering. arXiv:2404.10193 [cs.CV] https://arxiv.org/abs/2404.10193

work page arXiv 2024
[21]

Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. 2022. OCR-free Document Understanding Transformer. arXiv:2111.15664 [cs.LG] https://arxiv.org/abs/2111.15664

work page arXiv 2022
[22]

Youngjun Lee, Doyoung Kim, Junhyeok Kang, Jihwan Bang, Hwanjun Song, and Jae-Gil Lee. 2025. RA-TTA: Retrieval-Augmented Test-Time Adap- tation for Vision-Language Models. In International Conference on Learn- ing Representations , Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu (Eds.), Vol. 2025. 100590–100614. https://proceedings.iclr.cc/paper_files/paper/20...

work page 2025
[23]

Ming Li, Ruiyi Zhang, Jian Chen, Chenguang Wang, Jiuxiang Gu, Yufan Zhou, Franck Dernoncourt, Wanrong Zhu, Tianyi Zhou, and Tong Sun

work page
[24]

arXiv:2504.04974 [cs.CV] https://arxiv.org/abs/2504.04974

Towards Visual Text Grounding of Multimodal Large Language Model. arXiv:2504.04974 [cs.CV] https://arxiv.org/abs/2504.04974

work page arXiv
[25]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. 2023. Evaluating Object Hallucination in Large Vision-Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 292–305. https://doi...

work page doi:10.18653/v1/2023.emnlp-main 2023
[26]

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu- Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. 2024. OCRBench: on the hidden mystery of OCR in large multimodal models. Science China Information Sciences 67, 12 (Dec. 2024). https://doi.org/10.1007/s11432-024-4235-6

work page doi:10.1007/s11432-024-4235-6 2024
[27]

Tengchao Lv, Yupan Huang, Jingye Chen, Yuzhong Zhao, Yilin Jia, Lei Cui, Shuming Ma, Yaoyao Chang, Shaohan Huang, Wenhui Wang, Li Dong, Weiyao Luo, Shaoxiang Wu, Guoxin Wang, Cha Zhang, and Furu Wei. 2024. KOSMOS-2.5: A Multimodal Literate Model. arXiv:2309.11419 [cs.CL] https://arxiv.org/abs/ 2309.11419

work page arXiv 2024
[28]

Yuzhen Mao, Thibaut Durand, Nazanin Mehrasa, Jiawei He, and Martin Ester

work page
[29]

In Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025

Calibrating LLMs for Selective Prediction: Balancing Coverage and Risk. In Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025 . https: //openreview.net/forum?id=ZVZGjtP5VB

work page 2025
[30]

Mishra, K

A. Mishra, K. Alahari, and C. V. Jawahar. 2012. Scene Text Recognition using Higher Order Language Priors. In BMVC

work page 2012
[31]

I Can’t Believe It’s Not Better: Failure Modes in the Age of Foun- dation Models

Jie Ren, Yao Zhao, Tu Vu, Peter J. Liu, and Balaji Lakshminarayanan. 2023. Self-Evaluation Improves Selective Generation in Large Language Models. In Proceedings on "I Can’t Believe It’s Not Better: Failure Modes in the Age of Foun- dation Models" at NeurIPS 2023 Workshops (Proceedings of Machine Learning Research, Vol. 239), Javier Antorán, Arno Blaas, K...

work page 2023
[32]

William Rudman, Michal Golovanevsky, Dana Arad, Yonatan Belinkov, Ritamb- hara Singh, Carsten Eickhoff, and Kyle Mahowald. 2026. Mechanisms of Prompt- Induced Hallucination in Vision-Language Models. arXiv:2601.05201 [cs.CV] https://arxiv.org/abs/2601.05201

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

Jenny Schmalfuss, Nadine Chang, VS Vibashan, Maying Shen, Andrés Bruhn, and José M. Álvarez. 2025. PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025), 25081–25091. https: //api.semanticscholar.org/CorpusID:279447417

work page 2025
[34]

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. Scaling LLM Test- Time Compute Optimally can be More Effective than Scaling Model Parameters. arXiv:2408.03314 [cs.LG] https://arxiv.org/abs/2408.03314

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Selective Prediction

Tejas Srinivasan, Jack Hessel, Tanmay Gupta, Bill Yuchen Lin, Yejin Choi, Jesse Thomason, and Khyathi Raghavi Chandu. 2024. Selective "Selective Prediction": Reducing Unnecessary Abstention in Vision-Language Reasoning. arXiv:2402.15610 [cs.CL] https://arxiv.org/abs/2402.15610

work page arXiv 2024
[36]

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann Lecun, and Saining Xie. 2024. Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs. In Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition) . IEEE Co...

work page doi:10.1109/cvpr52733.2024.00914 2024
[37]

Kai Wang, Boris Babenko, and Serge Belongie. 2011. End-to-end scene text recognition. In 2011 International Conference on Computer Vision . 1457–1464. https://doi.org/10.1109/ICCV.2011.6126402

work page doi:10.1109/iccv.2011.6126402 2011
[38]

Qingni Wang, Yue Fan, and Xin Eric Wang. 2025. SAFER: Risk-Constrained Sample-then-Filter in Large Language Models. ArXiv abs/2510.10193 (2025). https://api.semanticscholar.org/CorpusID:282058131 From Plausibility to Verifiability: Risk-Controlled Generative OCR with Vision-Language Models

work page arXiv 2025
[39]

Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Im- proves Chain of Thought Reasoning in Language Models. In The Eleventh Inter- national Conference on Learning Representations . https://openreview.net/forum? id=1PL1NIMMrw

work page 2023
[40]

Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, and Lucy Lu Wang. 2025. Know Your Limits: A Survey of Abstention in Large Language Models. Transactions of the Association for Computational Linguistics 13 (2025), 529–556. https://doi.org/10.1162/tacl_a_00754

work page doi:10.1162/tacl_a_00754 2025
[41]

Efficient Guided Generation for Large Language Models

Brandon T. Willard and Rémi Louf. 2023. Efficient Guided Generation for Large Language Models. arXiv:2307.09702 [cs.CL] https://arxiv.org/abs/2307.09702

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Sangmin Woo, Donguk Kim, Jaehyuk Jang, Yubin Choi, and Changick Kim

work page
[43]

In Findings of the Association for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.)

Don’t Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models. In Findings of the Association for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Linguistics, Vienna, Austria, 1927–1951. https://doi.org/10.18653/v1/...

work page doi:10.18653/v1/2025.findings-acl.99 2025
[44]

Ya-Qi Yu, Minghui Liao, Jihao Wu, Yongxin Liao, Xiaoyu Zheng, and Wei Zeng

work page
[45]

arXiv:2404.09204 [cs.CV] https://arxiv.org/abs/2404

TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models. arXiv:2404.09204 [cs.CV] https://arxiv.org/abs/2404. 09204

work page arXiv
[46]

Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate Before Use: Improving Few-shot Performance of Language Models. InProceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 12697– 12706. https://proceedings.mlr.press/v1...

work page 2021
[47]

Yichen Zhu, Minjie Zhu, Ning Liu, Zhicai Ou, Xiaofeng Mou, and Jian Tang

work page
[48]

arXiv:2401.02330 [cs.CV] https://arxiv.org/abs/2401.02330

LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model. arXiv:2401.02330 [cs.CV] https://arxiv.org/abs/2401.02330

work page arXiv
[49]

Geigh Zollicoffer, Minh Vu, and Manish Bhattarai. 2025. MTRE: Multi-Token Reli- ability Estimation for Hallucination Detection in VLMs. arXiv:2505.11741 [cs.AI] https://arxiv.org/abs/2505.11741

work page arXiv 2025