DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA
Pith reviewed 2026-05-17 04:38 UTC · model grok-4.3
The pith
DocVAL improves compact document VQA models by up to 6-7 ANLS points via validated chain-of-thought distillation from larger teachers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DocVAL transfers explicit spatial reasoning from large teacher VLMs to compact student VLMs through teacher-generated spatial CoT supervision, a rule-based dual-mode validator that filters low-quality signals and supplies pixel-level corrective feedback, and a validation-driven two-stage training procedure with iterative refinement. Text detection operates only as training-time scaffolding, so the final student functions as a pure VLM without OCR or detection modules at inference. This produces consistent gains of up to 6-7 ANLS points across document benchmarks, strong performance under the introduced mean Average Precision localization metric, and the release of 95K validator-verified CoT,
What carries the argument
The rule-based dual-mode validator that filters low-quality teacher signals and supplies fine-grained pixel-level corrective feedback during two-stage training.
If this is right
- Compact VLMs reach higher accuracy on grounded document VQA while operating without OCR or detection at inference.
- High-quality validated supervision outperforms scaling up unfiltered teacher data for training efficiency.
- The framework supports more trustworthy spatial grounding in practical document understanding deployments.
- Mean Average Precision provides a useful metric for measuring localization quality in document question answering.
- Validator-verified CoT traces can be reused to train multiple efficient student models effectively.
Where Pith is reading between the lines
- Similar validation steps during distillation could raise performance in other vision-language reasoning tasks that require spatial or layout understanding.
- A learned validator might eventually replace the current rule-based version and reduce dependence on hand-crafted checks.
- Prioritizing supervision quality over raw data volume offers a general route to stronger small models without increasing model size.
Load-bearing premise
The rule-based dual-mode validator reliably filters low-quality teacher signals and supplies accurate pixel-level corrective feedback without introducing systematic bias or requiring ground-truth labels during validation.
What would settle it
Training the same compact VLM on identical teacher CoT traces but without the validator's filtering or corrections, and obtaining equal or higher ANLS and mAP scores, would indicate that validation adds no benefit.
Figures
read the original abstract
Document visual question answering requires models not only to answer questions correctly, but also to precisely localize answers within complex document layouts. While large vision-language models (VLMs) achieve strong spatial grounding, their inference cost and latency limit real-world deployment. Compact VLMs are more efficient, but they often suffer substantial localization degradation under standard fine-tuning or distillation. To address this gap, we propose DocVAL, a validated chain-of-thought (CoT) distillation framework that transfers explicit spatial reasoning from large teacher models to compact, deployable student VLMs. DocVAL combines (1) teacher-generated spatial CoT supervision, (2) a rule-based dual-mode validator that filters low-quality training signals and provides fine-grained, pixel-level corrective feedback, and (3) a validation-driven two-stage training procedure with iterative refinement. Text detection is used only as training-time scaffolding for supervision and validation, enabling the final student to operate as a pure VLM without OCR or detection at inference. Across multiple document understanding benchmarks, DocVAL yields consistent improvements of up to 6-7 ANLS points over comparable compact VLMs. We further introduce mean Average Precision (mAP) as a localization metric for document question answering and report strong spatial grounding performance under this new evaluation. We release 95K validator-verified CoT traces and show that high-quality, validated supervision is more effective than scaling unfiltered data, enabling efficient and trustworthy document grounding. Dataset and implementation: https://github.com/ahmad-shirazi/DocVAL
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes DocVAL, a validated chain-of-thought distillation framework to transfer spatial reasoning from large teacher VLMs to compact student VLMs for document visual question answering. It incorporates teacher-generated spatial CoT, a rule-based dual-mode validator that filters low-quality signals and provides pixel-level corrections using text detection as training scaffolding, and a validation-driven two-stage training with iterative refinement. The paper claims consistent improvements of up to 6-7 ANLS points over comparable compact VLMs across benchmarks, introduces mAP as a localization metric, and releases 95K validator-verified CoT traces along with code.
Significance. Should the results be confirmed with appropriate controls, DocVAL demonstrates that high-quality validated supervision can outperform scaling unfiltered data for efficient and trustworthy document grounding in compact models. The release of the dataset and implementation code is a strength that supports reproducibility in the field. Introducing mAP for evaluating spatial grounding in document QA is a positive addition to evaluation practices.
major comments (2)
- [§4] §4: The reported gains of up to 6-7 ANLS points are presented without ablation studies isolating the contribution of the dual-mode validator and validated CoT from the two-stage training procedure or the teacher CoT supervision alone. This is load-bearing for the central claim that the validation mechanism is responsible for the improvements over comparable compact VLMs.
- [§3.2] §3.2: The rule-based dual-mode validator is described as providing unbiased pixel-level corrective feedback without ground-truth labels, yet no error analysis, human validation of its decisions, or comparison to an oracle filter is included. This leaves the assumption that the rules reliably detect low-quality spatial signals without systematic bias untested, which is central to attributing gains to the validator rather than other training choices.
minor comments (1)
- [Abstract] Abstract: The claim of 'consistent improvements' would benefit from explicit mention of the exact baselines and any statistical significance testing used to support the 6-7 ANLS gains.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we will incorporate to strengthen the presentation of our results and methods.
read point-by-point responses
-
Referee: [§4] The reported gains of up to 6-7 ANLS points are presented without ablation studies isolating the contribution of the dual-mode validator and validated CoT from the two-stage training procedure or the teacher CoT supervision alone. This is load-bearing for the central claim that the validation mechanism is responsible for the improvements over comparable compact VLMs.
Authors: We agree that the current manuscript does not contain ablations that isolate the dual-mode validator from the two-stage training procedure and teacher CoT supervision. The reported gains reflect the full DocVAL pipeline. In the revised version we will add a dedicated ablation subsection in §4 that reports four controlled settings on the same student backbone and data: (1) standard fine-tuning, (2) teacher CoT distillation without any validation, (3) two-stage training with unvalidated CoT, and (4) the complete validated DocVAL pipeline. These experiments will quantify the incremental contribution of the validator and thereby support the claim that validated supervision drives the observed improvements. revision: yes
-
Referee: [§3.2] The rule-based dual-mode validator is described as providing unbiased pixel-level corrective feedback without ground-truth labels, yet no error analysis, human validation of its decisions, or comparison to an oracle filter is included. This leaves the assumption that the rules reliably detect low-quality spatial signals without systematic bias untested, which is central to attributing gains to the validator rather than other training choices.
Authors: We acknowledge that the manuscript currently lacks quantitative error analysis or human validation of the validator. The dual-mode validator applies deterministic rules derived from off-the-shelf text detection to filter and correct spatial CoT without using QA ground-truth labels. To address the concern, we will expand §3.2 with an error-analysis subsection that includes: (i) human agreement scores on a random sample of 500 validator decisions (accepted vs. filtered), (ii) an explicit discussion of possible rule-induced biases (e.g., layout-specific over-filtering), and (iii) a comparison against an oracle validator that has access to ground-truth answer boxes, reporting precision, recall, and F1 of the filtering step. These additions will provide direct evidence regarding the reliability and bias profile of the validator. revision: yes
Circularity Check
No significant circularity; empirical method with no self-referential derivations or fitted predictions.
full rationale
The paper describes an empirical framework involving teacher-generated CoT, a rule-based validator, and two-stage training, with performance claims based on benchmark results (ANLS and mAP). No mathematical derivation chain, equations, or first-principles results are presented that reduce to inputs by construction. There are no fitted parameters renamed as predictions, no self-citation load-bearing uniqueness theorems, and no ansatz smuggling. The central claims rest on experimental outcomes rather than circular definitions, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Large teacher VLMs generate accurate spatial chain-of-thought reasoning for document layouts.
- ad hoc to paper Rule-based validation can detect and correct low-quality spatial signals at pixel level without introducing bias.
invented entities (1)
-
Dual-mode validator
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Variational information distilla- Table 5
Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D Lawrence, and Zhenwen Dai. Variational information distilla- Table 5. Training strategy ablation across datasets. Training Config DocVQA VisualMRC FUNSD CORD SROIE ANLS mAP ANLS mAP ANLS mAP ANLS mAP ANLS mAP Phase B1 only (no iteration) 88.3 72.7 72.2 60.4 90.1 70.8 86.1 69.2 92.3 70.4 B1 + B2 (5 iterati...
work page 2019
-
[2]
Character region awareness for text detec- tion
Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee. Character region awareness for text detec- tion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 9365–9374, 2019
work page 2019
-
[3]
Knowledge distillation: A survey.International journal of computer vision, 129(6):1789–1819, 2021
Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey.International journal of computer vision, 129(6):1789–1819, 2021
work page 2021
-
[4]
Avani Gupta, Saurabh Saini, and PJ Narayanan. Concept dis- tillation: leveraging human-centered explanations for model improvement.Advances in Neural Information Processing Systems, 36:63724–63737, 2023
work page 2023
-
[5]
Layoutlmv3: Pre-training for document ai with unified text and image masking
Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. Layoutlmv3: Pre-training for document ai with unified text and image masking. InProceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091, 2022
work page 2022
-
[6]
Icdar2019 competi- tion on scanned receipt ocr and information extraction
Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and CV Jawahar. Icdar2019 competi- tion on scanned receipt ocr and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1516–1520. IEEE, 2019
work page 2019
-
[7]
Funsd: A dataset for form understanding in noisy scanned documents
Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. Funsd: A dataset for form understanding in noisy scanned documents. In2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), pages 1–6. IEEE, 2019
work page 2019
-
[8]
Show, attend and distill: Knowledge distillation via attention-based feature matching
Mingi Ji, Byeongho Heo, and Sungrae Park. Show, attend and distill: Knowledge distillation via attention-based feature matching. InProceedings of the AAAI conference on artificial intelligence, pages 7945–7952, 2021
work page 2021
-
[9]
Ocr-free document understanding transformer
Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sang- doo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. InEuropean Confer- ence on Computer Vision, pages 498–517. Springer, 2022
work page 2022
-
[10]
Pix2struct: Screenshot parsing as pretraining for visual lan- guage understanding
Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandel- wal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2struct: Screenshot parsing as pretraining for visual lan- guage understanding. InInternational Conference on Ma- chine Learning, pages 18893–18912. PMLR, 2023
work page 2023
-
[11]
Shiye Lei and Dacheng Tao. A comprehensive survey of dataset distillation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(1):17–32, 2023
work page 2023
-
[12]
Distilling large vision-language model with out-of-distribution generalizability
Xuanlin Li, Yunhao Fang, Minghua Liu, Zhan Ling, Zhuowen Tu, and Hao Su. Distilling large vision-language model with out-of-distribution generalizability. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2492–2503, 2023
work page 2023
-
[13]
Promptkd: Unsupervised prompt distillation for vision-language models
Zheng Li, Xiang Li, Xinyi Fu, Xin Zhang, Weiqiang Wang, Shuo Chen, and Jian Yang. Promptkd: Unsupervised prompt distillation for vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26617–26626, 2024
work page 2024
-
[14]
Real-time scene text detection with differentiable bina- rization
Minghui Liao, Zhaoyi Wan, Cong Yao, Kai Chen, and Xiang Bai. Real-time scene text detection with differentiable bina- rization. InProceedings of the AAAI conference on artificial intelligence, pages 11474–11481, 2020
work page 2020
-
[15]
Wenhui Liao, Jiapeng Wang, Hongliang Li, Chengyu Wang, Jun Huang, and Lianwen Jin. Doclayllm: An efficient and effective multi-modal extension of large language mod- els for text-rich document understanding.arXiv preprint arXiv:2408.15045, 2024
-
[16]
Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph, Denny Zhou, Jason Wei, Kevin Robinson, David Mimno, et al. A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity. InProceedings of the 2024 Con- ference of the North American Chapter of the Association for Computat...
work page 2024
-
[17]
Jinghui Lu, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, et al. A bounding box is worth one token: Interleav- ing layout and text in a large language model for document understanding.arXiv preprint arXiv:2407.01976, 2024
-
[18]
Layoutllm: Layout instruction tuning with large language models for document understanding
Chuwei Luo, Yufan Shen, Zhaoqing Zhu, Qi Zheng, Zhi Yu, and Cong Yao. Layoutllm: Layout instruction tuning with large language models for document understanding. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15630–15640, 2024
work page 2024
-
[19]
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hal- linan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: It- erative refinement with self-feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023
work page 2023
-
[20]
Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. Docvqa: A dataset for vqa on document images. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2200–2209, 2021
work page 2021
-
[21]
Eric Mitchell, Joseph J Noh, Siyan Li, William S Armstrong, Ananth Agarwal, Patrick Liu, Chelsea Finn, and Christo- pher D Manning. Enhancing self-consistency and perfor- mance of pre-trained language models through natural lan- guage inference.arXiv preprint arXiv:2211.11875, 2022
-
[22]
Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Ser- Nam Lim, and Rajiv Ramnath. Dlava: Document lan- guage and vision assistant for answer localization with en- hanced interpretability and trustworthiness.arXiv preprint arXiv:2412.00151, 2024
-
[23]
Cord: A consolidated receipt dataset for post-ocr parsing
Youngmin Park, Jihyung Tae, Seungyeon Kim, Jaeyoung Choi, and Rama Chellappa. Cord: A consolidated receipt dataset for post-ocr parsing. InProceedings of the IEEE/CVF International Conference on Document Analysis and Recog- nition (ICDAR), 2019
work page 2019
-
[24]
Generalized in- tersection over union: A metric and a loss for bounding box regression
Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- tersection over union: A metric and a loss for bounding box regression. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666, 2019
work page 2019
-
[25]
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems, 37:8612– 8642, 2025
work page 2025
-
[26]
Visualmrc: Machine reading comprehension on document images
Ryota Tanaka, Kyosuke Nishida, and Sen Yoshida. Visualmrc: Machine reading comprehension on document images. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 13878–13888, 2021
work page 2021
-
[27]
Pulkit Verma, Ngoc La, Anthony Favier, Swaroop Mishra, and Julie A Shah. Teaching llms to plan: Logical chain- of-thought instruction tuning for symbolic planning.arXiv preprint arXiv:2509.13351, 2025
-
[28]
Shape robust text detection with progressive scale expansion network
Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, Gang Yu, and Shuai Shao. Shape robust text detection with progressive scale expansion network. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 9336–9345, 2019
work page 2019
-
[29]
Chain-of- thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824– 24837, 2022
work page 2022
-
[30]
Llava-cot: Let vision language models reason step-by-step
Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 2087–2098, 2025
work page 2087
-
[31]
Layoutlm: Pre-training of text and layout for document image understanding
Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. Layoutlm: Pre-training of text and layout for document image understanding. InProceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (KDD), pages 1192–1200, 2020
work page 2020
-
[32]
arXiv preprint arXiv:2012.14740 , year=
Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanx- iang Che, et al. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding.arXiv preprint arXiv:2012.14740, 2020
-
[33]
Li Yujian and Liu Bo. A normalized levenshtein distance metric.IEEE transactions on pattern analysis and machine intelligence, 29(6):1091–1095, 2007
work page 2007
-
[34]
Ge Zheng, Bin Yang, Jiajin Tang, Hong-Yu Zhou, and Sibei Yang. Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models.Advances in Neu- ral Information Processing Systems, 36:5168–5191, 2023. A. Terminology and Design Decisions CoT Output.Throughout the paper, "CoT Output" refers to the complete tuple (CoT, a, b) where ...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.