pith. sign in

arxiv: 2511.22521 · v2 · submitted 2025-11-27 · 💻 cs.CV · cs.AI

DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA

Pith reviewed 2026-05-17 04:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords document visual question answeringchain-of-thought distillationvision-language modelsspatial groundingvalidated supervisioncompact modelslocalization metrics
0
0 comments X

The pith

DocVAL improves compact document VQA models by up to 6-7 ANLS points via validated chain-of-thought distillation from larger teachers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Document visual question answering demands both correct answers and precise localization of those answers inside complex layouts. Large vision-language models handle the spatial reasoning well but run too slowly for widespread use, while compact versions typically lose grounding accuracy under ordinary fine-tuning or distillation. DocVAL generates step-by-step spatial reasoning traces from a large teacher, then applies a rule-based validator to discard weak traces and supply pixel-level corrections before training the compact student. The student learns to output both the answer and its location using only the image and question, with text detection limited to training scaffolding. The result is that high-quality, filtered supervision produces better outcomes than simply scaling up raw teacher data.

Core claim

DocVAL transfers explicit spatial reasoning from large teacher VLMs to compact student VLMs through teacher-generated spatial CoT supervision, a rule-based dual-mode validator that filters low-quality signals and supplies pixel-level corrective feedback, and a validation-driven two-stage training procedure with iterative refinement. Text detection operates only as training-time scaffolding, so the final student functions as a pure VLM without OCR or detection modules at inference. This produces consistent gains of up to 6-7 ANLS points across document benchmarks, strong performance under the introduced mean Average Precision localization metric, and the release of 95K validator-verified CoT,

What carries the argument

The rule-based dual-mode validator that filters low-quality teacher signals and supplies fine-grained pixel-level corrective feedback during two-stage training.

If this is right

  • Compact VLMs reach higher accuracy on grounded document VQA while operating without OCR or detection at inference.
  • High-quality validated supervision outperforms scaling up unfiltered teacher data for training efficiency.
  • The framework supports more trustworthy spatial grounding in practical document understanding deployments.
  • Mean Average Precision provides a useful metric for measuring localization quality in document question answering.
  • Validator-verified CoT traces can be reused to train multiple efficient student models effectively.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar validation steps during distillation could raise performance in other vision-language reasoning tasks that require spatial or layout understanding.
  • A learned validator might eventually replace the current rule-based version and reduce dependence on hand-crafted checks.
  • Prioritizing supervision quality over raw data volume offers a general route to stronger small models without increasing model size.

Load-bearing premise

The rule-based dual-mode validator reliably filters low-quality teacher signals and supplies accurate pixel-level corrective feedback without introducing systematic bias or requiring ground-truth labels during validation.

What would settle it

Training the same compact VLM on identical teacher CoT traces but without the validator's filtering or corrections, and obtaining equal or higher ANLS and mAP scores, would indicate that validation adds no benefit.

Figures

Figures reproduced from arXiv: 2511.22521 by Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Rajiv Ramnath, Ser-Nam Lim.

Figure 1
Figure 1. Figure 1: Overview of the DocVAL Framework. A three-phase pipeline for validated chain-of-thought (CoT) distillation. Phase A: Teacher Data Generation —The teacher VLM (Gemini 2.5 Pro) generates CoT traces from raw documents, which are filtered by DocVAL for quality assurance. Phase B1: Student Fine-Tuning —A smaller VLM (Gemma3-12B) is fine-tuned on the validated CoT dataset. Phase B2: Instruction Tuning —The stude… view at source ↗
Figure 2
Figure 2. Figure 2: VAL Dual-Mode Architecture Comparison. VAL operates in two distinct modes sharing the same 5-module architecture but differing in output granularity. Left (Filter): Phase A processes teacher outputs (D2) at 50 examples/sec, producing binary Accept/Reject decisions to curate 95K high-quality training examples from 102K raw outputs. Right (Verifier): Phase B2 processes student outputs (D4) at 12 examples/sec… view at source ↗
read the original abstract

Document visual question answering requires models not only to answer questions correctly, but also to precisely localize answers within complex document layouts. While large vision-language models (VLMs) achieve strong spatial grounding, their inference cost and latency limit real-world deployment. Compact VLMs are more efficient, but they often suffer substantial localization degradation under standard fine-tuning or distillation. To address this gap, we propose DocVAL, a validated chain-of-thought (CoT) distillation framework that transfers explicit spatial reasoning from large teacher models to compact, deployable student VLMs. DocVAL combines (1) teacher-generated spatial CoT supervision, (2) a rule-based dual-mode validator that filters low-quality training signals and provides fine-grained, pixel-level corrective feedback, and (3) a validation-driven two-stage training procedure with iterative refinement. Text detection is used only as training-time scaffolding for supervision and validation, enabling the final student to operate as a pure VLM without OCR or detection at inference. Across multiple document understanding benchmarks, DocVAL yields consistent improvements of up to 6-7 ANLS points over comparable compact VLMs. We further introduce mean Average Precision (mAP) as a localization metric for document question answering and report strong spatial grounding performance under this new evaluation. We release 95K validator-verified CoT traces and show that high-quality, validated supervision is more effective than scaling unfiltered data, enabling efficient and trustworthy document grounding. Dataset and implementation: https://github.com/ahmad-shirazi/DocVAL

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes DocVAL, a validated chain-of-thought distillation framework to transfer spatial reasoning from large teacher VLMs to compact student VLMs for document visual question answering. It incorporates teacher-generated spatial CoT, a rule-based dual-mode validator that filters low-quality signals and provides pixel-level corrections using text detection as training scaffolding, and a validation-driven two-stage training with iterative refinement. The paper claims consistent improvements of up to 6-7 ANLS points over comparable compact VLMs across benchmarks, introduces mAP as a localization metric, and releases 95K validator-verified CoT traces along with code.

Significance. Should the results be confirmed with appropriate controls, DocVAL demonstrates that high-quality validated supervision can outperform scaling unfiltered data for efficient and trustworthy document grounding in compact models. The release of the dataset and implementation code is a strength that supports reproducibility in the field. Introducing mAP for evaluating spatial grounding in document QA is a positive addition to evaluation practices.

major comments (2)
  1. [§4] §4: The reported gains of up to 6-7 ANLS points are presented without ablation studies isolating the contribution of the dual-mode validator and validated CoT from the two-stage training procedure or the teacher CoT supervision alone. This is load-bearing for the central claim that the validation mechanism is responsible for the improvements over comparable compact VLMs.
  2. [§3.2] §3.2: The rule-based dual-mode validator is described as providing unbiased pixel-level corrective feedback without ground-truth labels, yet no error analysis, human validation of its decisions, or comparison to an oracle filter is included. This leaves the assumption that the rules reliably detect low-quality spatial signals without systematic bias untested, which is central to attributing gains to the validator rather than other training choices.
minor comments (1)
  1. [Abstract] Abstract: The claim of 'consistent improvements' would benefit from explicit mention of the exact baselines and any statistical significance testing used to support the 6-7 ANLS gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we will incorporate to strengthen the presentation of our results and methods.

read point-by-point responses
  1. Referee: [§4] The reported gains of up to 6-7 ANLS points are presented without ablation studies isolating the contribution of the dual-mode validator and validated CoT from the two-stage training procedure or the teacher CoT supervision alone. This is load-bearing for the central claim that the validation mechanism is responsible for the improvements over comparable compact VLMs.

    Authors: We agree that the current manuscript does not contain ablations that isolate the dual-mode validator from the two-stage training procedure and teacher CoT supervision. The reported gains reflect the full DocVAL pipeline. In the revised version we will add a dedicated ablation subsection in §4 that reports four controlled settings on the same student backbone and data: (1) standard fine-tuning, (2) teacher CoT distillation without any validation, (3) two-stage training with unvalidated CoT, and (4) the complete validated DocVAL pipeline. These experiments will quantify the incremental contribution of the validator and thereby support the claim that validated supervision drives the observed improvements. revision: yes

  2. Referee: [§3.2] The rule-based dual-mode validator is described as providing unbiased pixel-level corrective feedback without ground-truth labels, yet no error analysis, human validation of its decisions, or comparison to an oracle filter is included. This leaves the assumption that the rules reliably detect low-quality spatial signals without systematic bias untested, which is central to attributing gains to the validator rather than other training choices.

    Authors: We acknowledge that the manuscript currently lacks quantitative error analysis or human validation of the validator. The dual-mode validator applies deterministic rules derived from off-the-shelf text detection to filter and correct spatial CoT without using QA ground-truth labels. To address the concern, we will expand §3.2 with an error-analysis subsection that includes: (i) human agreement scores on a random sample of 500 validator decisions (accepted vs. filtered), (ii) an explicit discussion of possible rule-induced biases (e.g., layout-specific over-filtering), and (iii) a comparison against an oracle validator that has access to ground-truth answer boxes, reporting precision, recall, and F1 of the filtering step. These additions will provide direct evidence regarding the reliability and bias profile of the validator. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with no self-referential derivations or fitted predictions.

full rationale

The paper describes an empirical framework involving teacher-generated CoT, a rule-based validator, and two-stage training, with performance claims based on benchmark results (ANLS and mAP). No mathematical derivation chain, equations, or first-principles results are presented that reduce to inputs by construction. There are no fitted parameters renamed as predictions, no self-citation load-bearing uniqueness theorems, and no ansatz smuggling. The central claims rest on experimental outcomes rather than circular definitions, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The approach assumes large VLMs produce high-quality spatial CoT that can be validated by rules without ground truth; the validator itself is an invented component whose reliability is not independently evidenced beyond the reported gains.

axioms (2)
  • domain assumption Large teacher VLMs generate accurate spatial chain-of-thought reasoning for document layouts.
    Invoked in the description of teacher-generated supervision.
  • ad hoc to paper Rule-based validation can detect and correct low-quality spatial signals at pixel level without introducing bias.
    Central to the dual-mode validator component.
invented entities (1)
  • Dual-mode validator no independent evidence
    purpose: Filters low-quality CoT traces and supplies pixel-level corrective feedback during training.
    New component introduced in the framework; no external falsifiable evidence provided beyond the method description.

pith-pipeline@v0.9.0 · 5592 in / 1399 out tokens · 29630 ms · 2026-05-17T04:38:34.360870+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

  1. [1]

    Variational information distilla- Table 5

    Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D Lawrence, and Zhenwen Dai. Variational information distilla- Table 5. Training strategy ablation across datasets. Training Config DocVQA VisualMRC FUNSD CORD SROIE ANLS mAP ANLS mAP ANLS mAP ANLS mAP ANLS mAP Phase B1 only (no iteration) 88.3 72.7 72.2 60.4 90.1 70.8 86.1 69.2 92.3 70.4 B1 + B2 (5 iterati...

  2. [2]

    Character region awareness for text detec- tion

    Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee. Character region awareness for text detec- tion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 9365–9374, 2019

  3. [3]

    Knowledge distillation: A survey.International journal of computer vision, 129(6):1789–1819, 2021

    Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey.International journal of computer vision, 129(6):1789–1819, 2021

  4. [4]

    Concept dis- tillation: leveraging human-centered explanations for model improvement.Advances in Neural Information Processing Systems, 36:63724–63737, 2023

    Avani Gupta, Saurabh Saini, and PJ Narayanan. Concept dis- tillation: leveraging human-centered explanations for model improvement.Advances in Neural Information Processing Systems, 36:63724–63737, 2023

  5. [5]

    Layoutlmv3: Pre-training for document ai with unified text and image masking

    Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. Layoutlmv3: Pre-training for document ai with unified text and image masking. InProceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091, 2022

  6. [6]

    Icdar2019 competi- tion on scanned receipt ocr and information extraction

    Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and CV Jawahar. Icdar2019 competi- tion on scanned receipt ocr and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1516–1520. IEEE, 2019

  7. [7]

    Funsd: A dataset for form understanding in noisy scanned documents

    Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. Funsd: A dataset for form understanding in noisy scanned documents. In2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), pages 1–6. IEEE, 2019

  8. [8]

    Show, attend and distill: Knowledge distillation via attention-based feature matching

    Mingi Ji, Byeongho Heo, and Sungrae Park. Show, attend and distill: Knowledge distillation via attention-based feature matching. InProceedings of the AAAI conference on artificial intelligence, pages 7945–7952, 2021

  9. [9]

    Ocr-free document understanding transformer

    Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sang- doo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. InEuropean Confer- ence on Computer Vision, pages 498–517. Springer, 2022

  10. [10]

    Pix2struct: Screenshot parsing as pretraining for visual lan- guage understanding

    Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandel- wal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2struct: Screenshot parsing as pretraining for visual lan- guage understanding. InInternational Conference on Ma- chine Learning, pages 18893–18912. PMLR, 2023

  11. [11]

    A comprehensive survey of dataset distillation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(1):17–32, 2023

    Shiye Lei and Dacheng Tao. A comprehensive survey of dataset distillation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(1):17–32, 2023

  12. [12]

    Distilling large vision-language model with out-of-distribution generalizability

    Xuanlin Li, Yunhao Fang, Minghua Liu, Zhan Ling, Zhuowen Tu, and Hao Su. Distilling large vision-language model with out-of-distribution generalizability. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2492–2503, 2023

  13. [13]

    Promptkd: Unsupervised prompt distillation for vision-language models

    Zheng Li, Xiang Li, Xinyi Fu, Xin Zhang, Weiqiang Wang, Shuo Chen, and Jian Yang. Promptkd: Unsupervised prompt distillation for vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26617–26626, 2024

  14. [14]

    Real-time scene text detection with differentiable bina- rization

    Minghui Liao, Zhaoyi Wan, Cong Yao, Kai Chen, and Xiang Bai. Real-time scene text detection with differentiable bina- rization. InProceedings of the AAAI conference on artificial intelligence, pages 11474–11481, 2020

  15. [15]

    Doclayllm: An efficient and effective multi-modal extension of large language models for text-rich document understanding,

    Wenhui Liao, Jiapeng Wang, Hongliang Li, Chengyu Wang, Jun Huang, and Lianwen Jin. Doclayllm: An efficient and effective multi-modal extension of large language mod- els for text-rich document understanding.arXiv preprint arXiv:2408.15045, 2024

  16. [16]

    A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity

    Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph, Denny Zhou, Jason Wei, Kevin Robinson, David Mimno, et al. A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity. InProceedings of the 2024 Con- ference of the North American Chapter of the Association for Computat...

  17. [17]

    A bounding box is worth one token: Interleav- ing layout and text in a large language model for document understanding.arXiv preprint arXiv:2407.01976, 2024

    Jinghui Lu, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, et al. A bounding box is worth one token: Interleav- ing layout and text in a large language model for document understanding.arXiv preprint arXiv:2407.01976, 2024

  18. [18]

    Layoutllm: Layout instruction tuning with large language models for document understanding

    Chuwei Luo, Yufan Shen, Zhaoqing Zhu, Qi Zheng, Zhi Yu, and Cong Yao. Layoutllm: Layout instruction tuning with large language models for document understanding. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15630–15640, 2024

  19. [19]

    Self-refine: It- erative refinement with self-feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hal- linan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: It- erative refinement with self-feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023

  20. [20]

    Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. Docvqa: A dataset for vqa on document images. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2200–2209, 2021

  21. [21]

    Noh, Siyan Li, William S

    Eric Mitchell, Joseph J Noh, Siyan Li, William S Armstrong, Ananth Agarwal, Patrick Liu, Chelsea Finn, and Christo- pher D Manning. Enhancing self-consistency and perfor- mance of pre-trained language models through natural lan- guage inference.arXiv preprint arXiv:2211.11875, 2022

  22. [22]

    Dlava: Document lan- guage and vision assistant for answer localization with en- hanced interpretability and trustworthiness.arXiv preprint arXiv:2412.00151, 2024

    Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Ser- Nam Lim, and Rajiv Ramnath. Dlava: Document lan- guage and vision assistant for answer localization with en- hanced interpretability and trustworthiness.arXiv preprint arXiv:2412.00151, 2024

  23. [23]

    Cord: A consolidated receipt dataset for post-ocr parsing

    Youngmin Park, Jihyung Tae, Seungyeon Kim, Jaeyoung Choi, and Rama Chellappa. Cord: A consolidated receipt dataset for post-ocr parsing. InProceedings of the IEEE/CVF International Conference on Document Analysis and Recog- nition (ICDAR), 2019

  24. [24]

    Generalized in- tersection over union: A metric and a loss for bounding box regression

    Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- tersection over union: A metric and a loss for bounding box regression. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666, 2019

  25. [25]

    Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning

    Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning. Advances in Neural Information Processing Systems, 37:8612– 8642, 2025

  26. [26]

    Visualmrc: Machine reading comprehension on document images

    Ryota Tanaka, Kyosuke Nishida, and Sen Yoshida. Visualmrc: Machine reading comprehension on document images. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 13878–13888, 2021

  27. [27]

    Teaching llms to plan: Logical chain- of-thought instruction tuning for symbolic planning.arXiv preprint arXiv:2509.13351, 2025

    Pulkit Verma, Ngoc La, Anthony Favier, Swaroop Mishra, and Julie A Shah. Teaching llms to plan: Logical chain- of-thought instruction tuning for symbolic planning.arXiv preprint arXiv:2509.13351, 2025

  28. [28]

    Shape robust text detection with progressive scale expansion network

    Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, Gang Yu, and Shuai Shao. Shape robust text detection with progressive scale expansion network. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 9336–9345, 2019

  29. [29]

    Chain-of- thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824– 24837, 2022

  30. [30]

    Llava-cot: Let vision language models reason step-by-step

    Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 2087–2098, 2025

  31. [31]

    Layoutlm: Pre-training of text and layout for document image understanding

    Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. Layoutlm: Pre-training of text and layout for document image understanding. InProceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (KDD), pages 1192–1200, 2020

  32. [32]

    arXiv preprint arXiv:2012.14740 , year=

    Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanx- iang Che, et al. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding.arXiv preprint arXiv:2012.14740, 2020

  33. [33]

    A normalized levenshtein distance metric.IEEE transactions on pattern analysis and machine intelligence, 29(6):1091–1095, 2007

    Li Yujian and Liu Bo. A normalized levenshtein distance metric.IEEE transactions on pattern analysis and machine intelligence, 29(6):1091–1095, 2007

  34. [34]

    CoT Output

    Ge Zheng, Bin Yang, Jiajin Tang, Hong-Yu Zhou, and Sibei Yang. Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models.Advances in Neu- ral Information Processing Systems, 36:5168–5191, 2023. A. Terminology and Design Decisions CoT Output.Throughout the paper, "CoT Output" refers to the complete tuple (CoT, a, b) where ...