HG-Bench: A Benchmark for Multi-Page Handwritten Answer-Region Grounding in Automated Homework Assessment
Pith reviewed 2026-06-25 21:37 UTC · model grok-4.3
The pith
HG-Bench reveals that no zero-shot vision-language model exceeds 55 percent on locating full answers or 48 percent on their reasoning steps in multi-page handwritten homework.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HG-Bench supplies 500 human-annotated K-12 homework samples with question-level and step-level boxes linked by hierarchical containment. Its evaluation protocol separately measures complete-answer localization (FA) and step-level decomposition (FSm). No zero-shot system exceeds 55.22 percent on FA or 48.22 percent on FSm, while a GLM-4.6V 9B model fine-tuned on approximately ten thousand in-domain examples reaches 74.97/72.26, showing that current systems do not yet capture the spatial structure of student reasoning on the page.
What carries the argument
HG-Bench benchmark of multi-page homework images annotated with hierarchically contained question-level and step-level bounding boxes, paired with the FA and FSm page-aware evaluation metrics.
If this is right
- Automated homework assessment requires explicit spatial modeling of step-level structure rather than text recognition alone.
- Fine-tuning on in-domain handwritten homework data produces large gains on both full-answer and step-level localization.
- The hierarchical containment constraint supplies a concrete way to evaluate whether models recover the ordered structure of student reasoning.
- Multi-page page-aware protocols expose limitations that single-image or text-only evaluations miss.
Where Pith is reading between the lines
- The identified gap suggests current vision-language models may require architectural additions for hierarchical spatial reasoning before they can support reliable automated grading pipelines.
- HG-Bench could be extended to other subjects or languages to test whether the step-level grounding shortfall generalizes beyond the K-12 homework domain studied here.
- Models that close the gap on this benchmark may also improve performance on related document tasks that involve locating ordered reasoning within noisy multi-page inputs.
Load-bearing premise
The 500 human-annotated samples drawn from the 1,489,278-image pool, together with the hierarchical containment constraint, provide a representative and correctly labeled test of page-aware answer-region grounding.
What would settle it
An independent re-annotation of the 500 samples showing low inter-annotator agreement on step-level boxes, or a zero-shot model achieving above 70 percent on both FA and FSm on the published test set.
read the original abstract
Automated homework assessment depends not only on recognizing student answers, but also on accurately locating where each answer and each intermediate reasoning step appears in noisy, multi-page handwritten work. This paper addresses the missing evaluation setting of page-aware, two-level answer-region grounding: given a sequence of homework page images, a model must localize complete answer regions and their ordered step-level subregions. We introduce HG-Bench, a benchmark of 500 human-annotated K-12 homework samples curated from a 1,489,278-image source pool, with question-level and step-level boxes linked by a hierarchical containment constraint. HG-Bench is paired with a page-aware evaluation protocol that separately measures complete-answer localization (FA) and step-level decomposition (FSm), revealing whether models truly ground the spatial structure of student reasoning rather than merely parse visible text. Across frontier closed-source APIs and competitive open-weight VLMs, no zero-shot system exceeds 55.22% on FA or 48.22% on FSm, while a GLM-4.6V 9B reference model fine-tuned on ~10k in-domain examples reaches 74.97/72.26. These results identify step-level handwritten grounding as a concrete capability gap and provide a reproducible benchmark, evaluation protocol, and trained reference point for future work on automated homework assessment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HG-Bench, a benchmark of 500 human-annotated multi-page K-12 homework samples curated from a 1,489,278-image pool. It defines a two-level grounding task (question-level answer regions and ordered step-level subregions) under a hierarchical containment constraint, paired with a page-aware evaluation protocol that reports separate scores for complete-answer localization (FA) and step-level decomposition (FSm). Experiments show no zero-shot frontier VLM exceeds 55.22% FA or 48.22% FSm, while a GLM-4.6V 9B model fine-tuned on ~10k in-domain examples reaches 74.97/72.26, identifying step-level handwritten grounding as a capability gap.
Significance. If the annotations prove representative and correctly labeled, the benchmark supplies a reproducible, page-aware testbed that isolates spatial grounding of reasoning steps from mere text parsing. The inclusion of a trained reference model and the explicit FA/FSm split are concrete strengths that could accelerate progress on automated homework assessment.
major comments (1)
- [Abstract and HG-Bench construction section] Abstract and the section describing HG-Bench construction: the 500 samples are described as 'curated' from the 1,489,278-image pool with a 'hierarchical containment constraint,' yet no sampling frame, stratification criteria, annotation guidelines, or quality metrics (e.g., inter-annotator agreement on box boundaries or containment) are supplied. Because the headline performance gap rests on these 500 annotations being both representative and correctly labeled, the absence of this documentation is load-bearing for the central empirical claim.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for greater transparency in HG-Bench construction. The single major comment is addressed below; we agree that additional documentation is required to support the benchmark's claims and will incorporate it in revision.
read point-by-point responses
-
Referee: [Abstract and HG-Bench construction section] Abstract and the section describing HG-Bench construction: the 500 samples are described as 'curated' from the 1,489,278-image pool with a 'hierarchical containment constraint,' yet no sampling frame, stratification criteria, annotation guidelines, or quality metrics (e.g., inter-annotator agreement on box boundaries or containment) are supplied. Because the headline performance gap rests on these 500 annotations being both representative and correctly labeled, the absence of this documentation is load-bearing for the central empirical claim.
Authors: We agree that the current manuscript lacks sufficient detail on the curation process. In the revised version we will add a new subsection (likely 3.2 or 3.3) that explicitly describes: (1) the sampling frame and stratification criteria used to select the 500 samples from the 1,489,278-image pool (including grade/subject balance and page-count distribution); (2) the full annotation guidelines provided to annotators; and (3) quality metrics, including inter-annotator agreement on both box boundaries (e.g., IoU thresholds) and the hierarchical containment constraint. These additions will directly address the load-bearing concern for the reported performance gaps. revision: yes
Circularity Check
No circularity: benchmark curation and external model evaluation contain no self-referential derivations or fitted predictions.
full rationale
The paper introduces HG-Bench by selecting 500 samples from a 1.49M pool and applying human annotation under a hierarchical containment rule, then reports direct zero-shot and fine-tuned performance numbers from external models. No equations, parameter fitting, or predictions appear; the central claim (performance gap) is an empirical measurement on the constructed test set rather than a quantity derived from or equivalent to the curation inputs by construction. No self-citations, ansatzes, or uniqueness theorems are invoked to support any derivation. This is a standard benchmark paper whose claims rest on the external validity of the annotations, not on internal reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
ReferItGame: Referring to objects in photographs of natural scenes
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame: Referring to objects in photographs of natural scenes. InEMNLP, 2014
2014
-
[2]
Berg, and Tamara L
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. Modeling context in referring expressions. InECCV, 2016
2016
-
[3]
Gen- eration and comprehension of unambiguous object descriptions
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille, and Kevin Murphy. Gen- eration and comprehension of unambiguous object descriptions. InCVPR, 2016
2016
-
[4]
Plummer, Liwei Wang, Chris M
Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30K Entities: Collecting region-to- phrasecorrespondencesforricherimage-to-sentence models. InICCV, 2015
2015
-
[5]
Visual Genome: Connecting language and vision using crowdsourced dense im- age annotations.IJCV, 123(1):32–73, 2017
RanjayKrishna, YukeZhu, OliverGroth, JustinJohn- son, Kenji Hata, et al. Visual Genome: Connecting language and vision using crowdsourced dense im- age annotations.IJCV, 123(1):32–73, 2017
2017
-
[6]
Grounded language–image pre-training
Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Li- juan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language–image pre-training. InCVPR, 2022
2022
-
[7]
Grounding DINO: MarryingDINOwithgroundedpre-trainingforopen- set object detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: MarryingDINOwithgroundedpre-trainingforopen- set object detection. InECCV, 2024
2024
-
[8]
Shikra: Unleash- ing multimodal LLM’s referential dialogue magic
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleash- ing multimodal LLM’s referential dialogue magic. arXiv:2306.15195, 2023
Pith/arXiv arXiv 2023
-
[9]
Ferret: Refer and ground anything anywhere at any granularity
Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity. InICLR, 2024
2024
-
[10]
Minesh Mathew, Dimosthenis Karatzas, and C.V. Jawahar. DocVQA: A dataset for VQA on document images. InWACV, 2021
2021
-
[11]
ChartQA: A benchmark for question answering about charts with visual and logical reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of ACL, 2022
2022
-
[12]
Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimos- thenis Karatzas, Ernest Valveny, and C.V. Jawahar. InfographicVQA. InWACV, 2022
2022
-
[13]
Marti and Horst Bunke
U.-V. Marti and Horst Bunke. The IAM-database: An English sentence database for offline handwriting recognition.IJDAR, 5(1):39–46, 2002
2002
-
[14]
FUNSD: A dataset for form under- standing in noisy scanned documents
Guillaume Jaume, Hazim Kemal Ekenel, and Jean- Philippe Thiran. FUNSD: A dataset for form under- standing in noisy scanned documents. InICDAR Workshops, 2019
2019
-
[15]
CASIAonlineandofflineChinesehandwriting databases
Cheng-Lin Liu, Fei Yin, Da-Han Wang, and Qiu-Feng Wang. CASIAonlineandofflineChinesehandwriting databases. InICDAR, 2013
2013
-
[16]
ICDAR 2019 CROHME + TFD: Competi- tion on recognition of handwritten mathematical expressions and typeset formula detection
Mahshad Mahdavi, Richard Zanibbi, Harold Mouchere, Christian Viard-Gaudin, and Utpal Garain. ICDAR 2019 CROHME + TFD: Competi- tion on recognition of handwritten mathematical expressions and typeset formula detection. InIC- DAR, 2019
2019
-
[17]
LayoutLMv3: Pre-training for document AI with unified text and image masking
Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. LayoutLMv3: Pre-training for document AI with unified text and image masking. InACM Multimedia, 2022
2022
-
[18]
OCR-free document understand- ing transformer
Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Won- seok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. OCR-free document understand- ing transformer. InECCV, 2022
2022
-
[19]
Pix2Struct: Screenshot parsing as pre- training for visual language understanding
Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandel- wal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2Struct: Screenshot parsing as pre- training for visual language understanding. InICML, 2023
2023
-
[20]
UReader: Uni- versal OCR-free visually-situated language under- standing with multimodal large language model
Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chen- liang Li, Junfeng Tian, Qi Qian, Ji Zhang, Qin Jin, Liang He, Xin Lin, and Fei Huang. UReader: Uni- versal OCR-free visually-situated language under- standing with multimodal large language model. In Findings of EMNLP, 2023
2023
-
[21]
GPT-4V(ision) system card
OpenAI. GPT-4V(ision) system card. Technical re- port, 2024
2024
-
[22]
GPT-5.4: System card and deployment notes
OpenAI. GPT-5.4: System card and deployment notes. Technical report, OpenAI, 2026. https: //openai.com/index/gpt-5-system-card/
2026
-
[23]
Gemini: A family of highly capa- ble multimodal models
Google DeepMind. Gemini: A family of highly capa- ble multimodal models. Technical report, 2024
2024
-
[24]
The Claude 3 model family: Opus, Son- net, Haiku
Anthropic. The Claude 3 model family: Opus, Son- net, Haiku. Technical report, 2024
2024
-
[25]
System card: Claude Sonnet 4.6
Anthropic. System card: Claude Sonnet 4.6. Technical report, Anthropic, February 17, 2026. https://www.anthropic.com/ claude-sonnet-4-6-system-card
2026
-
[26]
ByteDance Seed Team. Seed1.5-VL technical report. arXiv preprint arXiv:2505.07062, 2025.https:// arxiv.org/abs/2505.07062
Pith/arXiv arXiv 2025
-
[27]
Kimi K2.5: Visual agentic intelligence
Kimi Team. Kimi K2.5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276, 2026.https:// 10 HG-Bench: Multi-Page Handwritten Answer-Region Grounding arxiv.org/abs/2602.02276
Pith/arXiv arXiv 2026
-
[28]
Qwen2-VL: Enhancing vision–language model’s perception of the world at any resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhi- hao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-VL: Enhancing vision–language model’s perception of the world at any resolution. Technical report, 2024
2024
-
[29]
Qwen2.5- VL technical report
Shuai Bai, Keqin Chen, Xuejing Liu, et al. Qwen2.5- VL technical report. Technical report, 2025
2025
-
[30]
InternVL: Scaling up vision foundation models and aligning for generic visual– linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, GuoChen,SenXing,MuyanZhong,QinglongZhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up vision foundation models and aligning for generic visual– linguistic tasks. InCVPR, 2024
2024
-
[31]
Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling (In- ternVL 2.5)
Zhe Chen, Weiyun Wang, Yue Cao, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling (In- ternVL 2.5). Technical report, 2025
2025
-
[32]
CogVLM2: Visual language models for image and video understanding
Weihan Wang, Wenyi Hong, Yean Cheng, et al. CogVLM2: Visual language models for image and video understanding. Technical report, 2024
2024
-
[33]
MiniCPM-V: A GPT-4V level MLLM on your phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, et al. MiniCPM-V: A GPT-4V level MLLM on your phone. Technical report, 2024
2024
-
[34]
Florence-2: Advancing a unified represen- tation for a variety of vision tasks
Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified represen- tation for a variety of vision tasks. InCVPR, 2024
2024
-
[35]
LLaVA- NeXT: Improved reasoning, OCR, and world knowl- edge
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuan- han Zhang, Sheng Shen, and Yong Jae Lee. LLaVA- NeXT: Improved reasoning, OCR, and world knowl- edge. Technical report, 2024
2024
-
[36]
DeepSeek-VL2: Mixture-of-experts vision–language models for advanced multimodal understanding
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, et al. DeepSeek-VL2: Mixture-of-experts vision–language models for advanced multimodal understanding. Technical report, 2024
2024
-
[37]
Phi-3 technical report: A highly capable language model locally on your phone
Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, et al. Phi-3 technical report: A highly capable language model locally on your phone. Technical report, 2024
2024
-
[38]
GLM-V Team. GLM-4.5V and GLM-4.1V-Thinking: Towards versatile multimodal reasoning with scal- able reinforcement learning.arXiv preprint arXiv:2507.01006, 2025. https://arxiv.org/ abs/2507.01006
Pith/arXiv arXiv 2025
-
[39]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, et al. Learning transferable visual models from natural language supervision. InICML, 2021
2021
-
[40]
Le, Yunhsuan Sung, Zhen Li, and Tom Duerig
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InICML, 2021
2021
-
[41]
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer
NoamShazeer, AzaliaMirhoseini, KrzysztofMaziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InICLR, 2017
2017
-
[42]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR, 23(120):1–39, 2022
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR, 23(120):1–39, 2022
2022
-
[43]
Jiang, Alexandre Sablayrolles, Antoine Roux, et al
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, et al. Mixtral of Experts. Technical report, 2024
2024
-
[44]
Xu, Huazuo Gao, Deli Chen, et al
Damai Dai, Chengqi Deng, Chenggang Zhao, R.X. Xu, Huazuo Gao, Deli Chen, et al. DeepSeekMoE: Towards ultimate expert specialization in mixture- of-experts language models. InACL, 2024
2024
-
[45]
MathVista: Evaluating mathematical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. InICLR, 2024
2024
-
[46]
Math- Verse: Does your multi-modal LLM truly see the diagrams in visual math problems? InECCV, 2024
Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. Math- Verse: Does your multi-modal LLM truly see the diagrams in visual math problems? InECCV, 2024
2024
-
[47]
We- Math: Does your large multimodal model achieve human-like mathematical reasoning? Technical re- port, 2024
Runqi Qiao, Qiuna Tan, Guanting Dong, et al. We- Math: Does your large multimodal model achieve human-like mathematical reasoning? Technical re- port, 2024
2024
-
[48]
Measuring multi- modal mathematical reasoning with MATH-Vision dataset
Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multi- modal mathematical reasoning with MATH-Vision dataset. InNeurIPS, 2024
2024
-
[49]
OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific prob- lems
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific prob- lems. InACL, 2024
2024
-
[50]
MMBench: Is your multi-modal model an all- around player? InECCV, 2024
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. MMBench: Is your multi-modal model an all- around player? InECCV, 2024
2024
-
[51]
SEED-Bench: Benchmarking multimodal LLMs with generative comprehension
BohaoLi,RuiWang,GuangzhiWang,YuyingGe,Yix- iao Ge, and Ying Shan. SEED-Bench: Benchmarking multimodal LLMs with generative comprehension. Technical report, 2023. 11 HG-Bench: Multi-Page Handwritten Answer-Region Grounding
2023
-
[52]
MMMU: A massive multi- discipline multimodal understanding and reasoning benchmark for expert AGI
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, et al. MMMU: A massive multi- discipline multimodal understanding and reasoning benchmark for expert AGI. InCVPR, 2024
2024
-
[53]
Holis- tic evaluation of language models.TMLR, 2023
PercyLiang, RishiBommasani, TonyLee, etal. Holis- tic evaluation of language models.TMLR, 2023
2023
-
[54]
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. TMLR, 2023
2023
-
[55]
A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960
Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960
1960
-
[56]
box_2d": [100, 200, 180, 300], 4
Joseph L. Fleiss. Measuring nominal scale agree- ment among many raters.Psychological Bulletin, 76(5):378–382, 1971. 12 HG-Bench: Multi-Page Handwritten Answer-Region Grounding A Annotation Guidelines (Excerpts) This appendix summarizes the core rules used by the annotator pool (Section 4). The full guideline document is released with the benchmark. Skip ...
1971
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.