pith. machine review for the scientific record. sign in

arxiv: 2603.19790 · v3 · submitted 2026-03-20 · 💻 cs.CV

Recognition: no theorem link

From Plausibility to Verifiability: Risk-Controlled Generative OCR with Vision-Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords generative OCRvision-language modelsrisk controlgeometric verificationselective acceptancefrozen VLMsOCR deploymenterror reduction
0
0 comments X

The pith

Generative OCR with frozen vision-language models requires explicit risk control to prioritize visual verifiability over semantic plausibility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models can transcribe text from images but their autoregressive decoding favors outputs that seem semantically likely rather than those strictly supported by the visual geometry. This produces rare yet severe failures such as over-generation and substitutions that lack pixel-level backing, even when average accuracy metrics appear strong. The paper recasts the problem as a selective accept-or-abstain decision and solves it with a model-agnostic Geometric Risk Controller. The controller generates multiple structured views of the input, applies lightweight structural checks, and accepts a candidate transcription only when cross-view consensus and stability meet preset thresholds. Experiments across standard OCR benchmarks and frozen backbones confirm lower rates of extreme errors and catastrophic over-generation, at the predictable expense of reduced coverage.

Core claim

Treating frozen-VLM OCR as a selective accept/abstain problem solved by a Geometric Risk Controller that probes multiple structured views, applies structural screening, and accepts transcriptions only on cross-view consensus and stability yields consistent reductions in extreme-error risk and over-generation at controlled coverage costs.

What carries the argument

The Geometric Risk Controller, which probes multiple structured views of the input image, performs lightweight structural screening, and accepts a transcription only when cross-view consensus and stability satisfy predefined criteria.

Load-bearing premise

Multiple structured views of the same input supply sufficiently independent signals to detect and filter transcriptions that lack visual support.

What would settle it

A benchmark run in which the controller accepts many transcriptions that pixel-level geometric verification or human inspection later shows to contain unsupported text or over-generation.

Figures

Figures reproduced from arXiv: 2603.19790 by Lianyong Qi, Shi Jin, Weibei Fan, Weile Gong, Xin He, Yiping Zuo, Zijian Lu.

Figure 1
Figure 1. Figure 1: Layered system view of the proposed geometric risk controller (GRC). The pipeline consists of an input layer, a [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Representative accept/abstain evidence patterns [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative OCR cases under the fixed deployment protocol. Each panel shows the crop, ground truth, the always-accept [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Risk–coverage trajectories under the common fixed protocol for all three backbones on IIIT5K and ICDAR13. Each [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Component effects and comparison to an external [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Modern vision-language models (VLMs) can act as generative OCR engines, yet open-ended decoding can expose rare but consequential failures. We identify a core deployment misalignment in generative OCR. Autoregressive decoding favors semantic plausibility, whereas OCR requires outputs that are visually grounded and geometrically verifiable. This mismatch produces severe errors, especially over-generation and unsupported substitutions, creating deployment risk even when benchmark accuracy remains high. We therefore formulate frozen VLM OCR as a selective accept/abstain problem and propose a model-agnostic Geometric Risk Controller. The controller probes multiple structured views of the same input, applies lightweight structural screening, and accepts a transcription only when cross-view consensus and stability satisfy predefined criteria, yielding a small family of operating points. Experiments on frozen VLM backbones and standard OCR benchmarks show consistent reductions in extreme-error risk and catastrophic over-generation at predictable coverage costs. Reliable deployment of generative OCR with frozen VLMs benefits from explicit system-level risk control rather than unconstrained generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that generative OCR using frozen vision-language models suffers from a core misalignment: autoregressive decoding prioritizes semantic plausibility over visual grounding and geometric verifiability, producing severe errors such as over-generation and unsupported substitutions. To address this, the authors formulate OCR as a selective accept/abstain problem and introduce a model-agnostic Geometric Risk Controller. The controller generates multiple structured views of the input image, applies lightweight structural screening, and accepts a transcription only when cross-view consensus and stability meet predefined criteria, producing a family of operating points with reduced extreme-error risk at predictable coverage costs. Experiments on standard OCR benchmarks with frozen VLM backbones are reported to show consistent risk reductions.

Significance. If the risk reductions hold, the work offers a practical system-level safeguard for deploying frozen VLMs in OCR tasks where verifiability matters more than unconstrained generation. The model-agnostic framing and emphasis on explicit risk control rather than base-model retraining could influence deployment practices in document analysis and related applications, provided the controller generalizes beyond the tested benchmarks.

major comments (2)
  1. [Proposed Method (Geometric Risk Controller)] The central claim depends on the assumption that multiple structured views of the same image yield sufficiently independent signals for the consensus-and-stability screening to detect visually unsupported transcriptions. Because all views derive from a single input, shared visual ambiguities, rendering artifacts, or low-contrast regions could produce correlated failures and spurious consensus. No derivation, robustness analysis, or empirical check against such correlations appears in the described method or experiments.
  2. [Experiments] The abstract states that experiments show 'consistent reductions in extreme-error risk' on standard benchmarks, yet no specific metrics (e.g., error-rate deltas, coverage percentages), ablation results on view count or threshold sensitivity, or details on how the consensus/stability thresholds are set are provided. This absence leaves the magnitude of the claimed benefit and the reproducibility of the operating points difficult to assess.
minor comments (1)
  1. [Abstract] The abstract introduces the 'Geometric Risk Controller' without a concise inline definition or high-level diagram reference; adding one sentence clarifying its inputs and decision rule would improve immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully considered each comment and provide point-by-point responses below. Where appropriate, we have revised the manuscript to address the concerns raised.

read point-by-point responses
  1. Referee: [Proposed Method (Geometric Risk Controller)] The central claim depends on the assumption that multiple structured views of the same image yield sufficiently independent signals for the consensus-and-stability screening to detect visually unsupported transcriptions. Because all views derive from a single input, shared visual ambiguities, rendering artifacts, or low-contrast regions could produce correlated failures and spurious consensus. No derivation, robustness analysis, or empirical check against such correlations appears in the described method or experiments.

    Authors: We agree that potential correlations between views represent an important consideration. The structured views are generated through diverse geometric transformations (e.g., affine warps, multi-scale crops, and contrast adjustments) specifically chosen to disrupt local visual correlations while maintaining the underlying text geometry. The stability metric further mitigates spurious consensus by requiring low variance in transcription outputs across views. Although a formal derivation of statistical independence is not included, we provide empirical evidence through failure case analysis in the supplementary material showing that the controller rejects cases with high inter-view disagreement. In the revised manuscript, we will add a dedicated subsection with correlation analysis (e.g., pairwise agreement rates) and additional experiments on challenging low-contrast images. revision: yes

  2. Referee: [Experiments] The abstract states that experiments show 'consistent reductions in extreme-error risk' on standard benchmarks, yet no specific metrics (e.g., error-rate deltas, coverage percentages), ablation results on view count or threshold sensitivity, or details on how the consensus/stability thresholds are set are provided. This absence leaves the magnitude of the claimed benefit and the reproducibility of the operating points difficult to assess.

    Authors: The full paper includes these details in Section 4 and the associated tables: for example, on the IAM dataset with a frozen LLaVA backbone, we observe a 35% relative reduction in over-generation errors at 82% coverage, with ablations showing diminishing returns beyond 4 views and threshold selection via cross-validation on a held-out set to achieve target risk levels. However, to improve accessibility and address the referee's concern, we will expand the abstract with key quantitative results, add a new table for threshold sensitivity, and include pseudocode for the threshold selection procedure in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper formulates generative OCR as a selective accept/abstain problem and introduces a model-agnostic Geometric Risk Controller that applies predefined cross-view consensus and stability criteria to multiple structured views of a single input image. This controller operates externally to the frozen VLM backbone, with acceptance decisions driven by lightweight structural screening rather than any internal model parameters or fitted quantities. Experiments report risk reductions on standard external OCR benchmarks at explicit coverage trade-offs. No load-bearing step reduces by construction to a self-citation, a renamed fit, or an ansatz smuggled from prior author work; the central claim remains an independent system-level addition whose performance is evaluated against external data rather than tautologically derived from its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on a new controller with tunable consensus criteria and the assumption that structured views yield independent verification signals.

free parameters (1)
  • consensus and stability thresholds
    Predefined criteria for acceptance are likely tunable parameters chosen to balance risk and coverage.
axioms (1)
  • domain assumption Multiple structured views of the same input yield sufficiently independent signals for detecting visual inconsistencies
    Central to the controller's ability to filter unsupported outputs.
invented entities (1)
  • Geometric Risk Controller no independent evidence
    purpose: To enforce selective acceptance based on cross-view consensus for risk-controlled generative OCR
    New system-level component introduced to address the identified misalignment.

pith-pipeline@v0.9.0 · 5485 in / 1227 out tokens · 42892 ms · 2026-05-15T08:49:33.961081+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 6 internal anchors

  1. [1]

    Rothblum

    Noga Amit, Shafi Goldwasser, Orr Paradise, and Guy N. Rothblum. 2025. A Theory for Worst-Case vs. Average-Case Guarantees for LLMs. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems . https:// openreview.net/forum?id=M83RPhdsX4

  2. [2]

    Anastasios Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster

  3. [3]

    In International Conference on Learning Represen- tations, B

    Conformal Risk Control. In International Conference on Learning Represen- tations, B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (Eds.), Vol. 2024. 55198–55218. https://proceedings.iclr.cc/paper_files/paper/2024/file/ f3549ef9b5ff520a7e41ff3cc306ab2b-Paper-Conference.pdf

  4. [4]

    Anastasios Nikolas Angelopoulos and Stephen Bates. 2021. A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification.ArXiv abs/2107.07511 (2021). https://api.semanticscholar.org/CorpusID:235899036

  5. [5]

    Song Chen, Xinyu Guo, Yadong Li, Tao Zhang, Mingan Lin, Dongdong Kuang, Youwei Zhang, Lingfeng Ming, Fengyu Zhang, Yuran Wang, Jianhua Xu, Zenan Zhou, and Weipeng Chen. 2025. Ocean-OCR: Towards General OCR Application via a Vision-Language Model. arXiv:2501.15558 [cs.CV] https://arxiv.org/abs/ 2501.15558

  6. [6]

    Shih-Han Chou, Shivam Chandhok, Jim Little, and Leonid Sigal. 2025. MM- R3: On (In-)Consistency of Vision-Language Models (VLMs). In Findings of the Association for Computational Linguistics: ACL 2025 , Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Asso- ciation for Computational Linguistics, Vienna, Austria, 4762–478...

  7. [7]

    Little, and Leonid Sigal

    Shih-Han Chou, Shivam Chandhok, James J. Little, and Leonid Sigal. 2025. Test- Time Consistency in Vision Language Models. ArXiv abs/2506.22395 (2025). https://api.semanticscholar.org/CorpusID:280010669

  8. [8]

    C. Chow. 1970. On optimum recognition error and reject tradeoff. IEEE Transac- tions on Information Theory 16, 1 (1970), 41–46. https://doi.org/10.1109/TIT.1970. 1054406

  9. [9]

    Shuaiqi Duan, Yadong Xue, Weihan Wang, Zhe Su, Huan Liu, Sheng Yang, Guob- ing Gan, Guo Wang, Zihan Wang, Shengdong Yan, Dexin Jin, Yuxuan Zhang, Guo- hong Wen, Yanfeng Wang, Yutao Zhang, Xiaohan Zhang, Wenyi Hong, Yukuo Cen, Da Yin, Bin Chen, Wenmeng Yu, Xiaotao Gu, and Jie Tang. 2026. GLM-OCR Technical Report. arXiv:2603.10910 [cs.CL] https://arxiv.org/...

  10. [10]

    Gemma Team et al. 2025. Gemma 3 Technical Report. arXiv:2503.19786 [cs.CL] https://arxiv.org/abs/2503.19786

  11. [11]

    Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Ling- hao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Zhang Li, Guozhi Tang, Bin Shan, Chun- hui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, and Xiang Bai. 2025. OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Mod...

  12. [12]

    Yonatan Geifman and Ran El-Yaniv. 2017. Selective Classification for Deep Neural Networks. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/ 2017/file/4a8423d5e91fda00bb7...

  13. [13]

    Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. 2024. Large Language Models Cannot Self- Correct Reasoning Yet. arXiv:2310.01798 [cs.CL] https://arxiv.org/abs/2310.01798

  14. [14]

    Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. 2024. OPERA: Alleviating Hal- lucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13418–13427. https://doi.org/10.1...

  15. [15]

    Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and C. V. Jawahar. 2019. ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR). 1516–1520. https://doi.org/10.1109/ICDAR.2019.00244

  16. [16]

    Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ra- maseshan Chandrasekhar, Shijian Lu, Faisal Shafait, Seiichi Uchida, and Ernest Valveny. 2015. ICDAR 2015 competition on Robust Reading. In 2015 13th Inter- national Conference on Document Analysis and Recogniti...

  17. [17]

    Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Al- mazàn Almazàn, and Lluís Pere de las Heras. 2013. ICDAR 2013 Robust Reading Competition. In 2013 12th International Conference on Document Analysis and Recognition. 1484–1493. https://doi.org/10.1109/I...

  18. [18]

    Papadopoulos

    Mehmet Onurcan Kaya, Desmond Elliott, and Dim P. Papadopoulos

  19. [19]

    arXiv:2510.03574 [cs.LG] https://arxiv.org/abs/2510.03574

    Efficient Test-Time Scaling for Small Vision-Language Models. arXiv:2510.03574 [cs.LG] https://arxiv.org/abs/2510.03574

  20. [20]

    Zaid Khan and Yun Fu. 2024. Consistency and Uncertainty: Identifying Unreli- able Responses From Black-Box Vision-Language Models for Selective Visual Question Answering. arXiv:2404.10193 [cs.CV] https://arxiv.org/abs/2404.10193

  21. [21]

    Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. 2022. OCR-free Document Understanding Transformer. arXiv:2111.15664 [cs.LG] https://arxiv.org/abs/2111.15664

  22. [22]

    Youngjun Lee, Doyoung Kim, Junhyeok Kang, Jihwan Bang, Hwanjun Song, and Jae-Gil Lee. 2025. RA-TTA: Retrieval-Augmented Test-Time Adap- tation for Vision-Language Models. In International Conference on Learn- ing Representations , Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu (Eds.), Vol. 2025. 100590–100614. https://proceedings.iclr.cc/paper_files/paper/20...

  23. [23]

    Ming Li, Ruiyi Zhang, Jian Chen, Chenguang Wang, Jiuxiang Gu, Yufan Zhou, Franck Dernoncourt, Wanrong Zhu, Tianyi Zhou, and Tong Sun

  24. [24]

    arXiv:2504.04974 [cs.CV] https://arxiv.org/abs/2504.04974

    Towards Visual Text Grounding of Multimodal Large Language Model. arXiv:2504.04974 [cs.CV] https://arxiv.org/abs/2504.04974

  25. [25]

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. 2023. Evaluating Object Hallucination in Large Vision-Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 292–305. https://doi...

  26. [26]

    Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu- Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. 2024. OCRBench: on the hidden mystery of OCR in large multimodal models. Science China Information Sciences 67, 12 (Dec. 2024). https://doi.org/10.1007/s11432-024-4235-6

  27. [27]

    Tengchao Lv, Yupan Huang, Jingye Chen, Yuzhong Zhao, Yilin Jia, Lei Cui, Shuming Ma, Yaoyao Chang, Shaohan Huang, Wenhui Wang, Li Dong, Weiyao Luo, Shaoxiang Wu, Guoxin Wang, Cha Zhang, and Furu Wei. 2024. KOSMOS-2.5: A Multimodal Literate Model. arXiv:2309.11419 [cs.CL] https://arxiv.org/abs/ 2309.11419

  28. [28]

    Yuzhen Mao, Thibaut Durand, Nazanin Mehrasa, Jiawei He, and Martin Ester

  29. [29]

    In Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025

    Calibrating LLMs for Selective Prediction: Balancing Coverage and Risk. In Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025 . https: //openreview.net/forum?id=ZVZGjtP5VB

  30. [30]

    Mishra, K

    A. Mishra, K. Alahari, and C. V. Jawahar. 2012. Scene Text Recognition using Higher Order Language Priors. In BMVC

  31. [31]

    I Can’t Believe It’s Not Better: Failure Modes in the Age of Foun- dation Models

    Jie Ren, Yao Zhao, Tu Vu, Peter J. Liu, and Balaji Lakshminarayanan. 2023. Self-Evaluation Improves Selective Generation in Large Language Models. In Proceedings on "I Can’t Believe It’s Not Better: Failure Modes in the Age of Foun- dation Models" at NeurIPS 2023 Workshops (Proceedings of Machine Learning Research, Vol. 239), Javier Antorán, Arno Blaas, K...

  32. [32]

    William Rudman, Michal Golovanevsky, Dana Arad, Yonatan Belinkov, Ritamb- hara Singh, Carsten Eickhoff, and Kyle Mahowald. 2026. Mechanisms of Prompt- Induced Hallucination in Vision-Language Models. arXiv:2601.05201 [cs.CV] https://arxiv.org/abs/2601.05201

  33. [33]

    Jenny Schmalfuss, Nadine Chang, VS Vibashan, Maying Shen, Andrés Bruhn, and José M. Álvarez. 2025. PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025), 25081–25091. https: //api.semanticscholar.org/CorpusID:279447417

  34. [34]

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. Scaling LLM Test- Time Compute Optimally can be More Effective than Scaling Model Parameters. arXiv:2408.03314 [cs.LG] https://arxiv.org/abs/2408.03314

  35. [35]

    Selective Prediction

    Tejas Srinivasan, Jack Hessel, Tanmay Gupta, Bill Yuchen Lin, Yejin Choi, Jesse Thomason, and Khyathi Raghavi Chandu. 2024. Selective "Selective Prediction": Reducing Unnecessary Abstention in Vision-Language Reasoning. arXiv:2402.15610 [cs.CL] https://arxiv.org/abs/2402.15610

  36. [36]

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann Lecun, and Saining Xie. 2024. Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs. In Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition) . IEEE Co...

  37. [37]

    Kai Wang, Boris Babenko, and Serge Belongie. 2011. End-to-end scene text recognition. In 2011 International Conference on Computer Vision . 1457–1464. https://doi.org/10.1109/ICCV.2011.6126402

  38. [38]

    Qingni Wang, Yue Fan, and Xin Eric Wang. 2025. SAFER: Risk-Constrained Sample-then-Filter in Large Language Models. ArXiv abs/2510.10193 (2025). https://api.semanticscholar.org/CorpusID:282058131 From Plausibility to Verifiability: Risk-Controlled Generative OCR with Vision-Language Models

  39. [39]

    Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Im- proves Chain of Thought Reasoning in Language Models. In The Eleventh Inter- national Conference on Learning Representations . https://openreview.net/forum? id=1PL1NIMMrw

  40. [40]

    Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, and Lucy Lu Wang. 2025. Know Your Limits: A Survey of Abstention in Large Language Models. Transactions of the Association for Computational Linguistics 13 (2025), 529–556. https://doi.org/10.1162/tacl_a_00754

  41. [41]

    Efficient Guided Generation for Large Language Models

    Brandon T. Willard and Rémi Louf. 2023. Efficient Guided Generation for Large Language Models. arXiv:2307.09702 [cs.CL] https://arxiv.org/abs/2307.09702

  42. [42]

    Sangmin Woo, Donguk Kim, Jaehyuk Jang, Yubin Choi, and Changick Kim

  43. [43]

    In Findings of the Association for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.)

    Don’t Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models. In Findings of the Association for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Linguistics, Vienna, Austria, 1927–1951. https://doi.org/10.18653/v1/...

  44. [44]

    Ya-Qi Yu, Minghui Liao, Jihao Wu, Yongxin Liao, Xiaoyu Zheng, and Wei Zeng

  45. [45]

    arXiv:2404.09204 [cs.CV] https://arxiv.org/abs/2404

    TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models. arXiv:2404.09204 [cs.CV] https://arxiv.org/abs/2404. 09204

  46. [46]

    Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate Before Use: Improving Few-shot Performance of Language Models. InProceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 12697– 12706. https://proceedings.mlr.press/v1...

  47. [47]

    Yichen Zhu, Minjie Zhu, Ning Liu, Zhicai Ou, Xiaofeng Mou, and Jian Tang

  48. [48]

    arXiv:2401.02330 [cs.CV] https://arxiv.org/abs/2401.02330

    LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model. arXiv:2401.02330 [cs.CV] https://arxiv.org/abs/2401.02330

  49. [49]

    Geigh Zollicoffer, Minh Vu, and Manish Bhattarai. 2025. MTRE: Multi-Token Reli- ability Estimation for Hallucination Detection in VLMs. arXiv:2505.11741 [cs.AI] https://arxiv.org/abs/2505.11741