HG-Bench: A Benchmark for Multi-Page Handwritten Answer-Region Grounding in Automated Homework Assessment

Boyan Shi; Canran Xiao; Chuangxin Zhao; Jiali Chen; Ji Qi; Juanzi Li; Jun Xia; Yanling Wang; Yan Wang; Yijian Lu

arxiv: 2606.25491 · v1 · pith:Q5TO3UEWnew · submitted 2026-06-24 · 💻 cs.CV · cs.AI

HG-Bench: A Benchmark for Multi-Page Handwritten Answer-Region Grounding in Automated Homework Assessment

Chuangxin Zhao , Boyan Shi , Yanling Wang , Yijian LU , Canran Xiao , Jiali Chen , Jun Xia , Yan Wang

show 2 more authors

Ji Qi Juanzi Li

This is my paper

Pith reviewed 2026-06-25 21:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords handwritten answer groundingmulti-page document analysishomework assessmentvision-language modelsbenchmarkstep-level localizationhierarchical bounding boxesautomated grading

0 comments

The pith

HG-Bench reveals that no zero-shot vision-language model exceeds 55 percent on locating full answers or 48 percent on their reasoning steps in multi-page handwritten homework.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HG-Bench to test whether models can locate both complete answer regions and the ordered step-level subregions inside them across sequences of homework page images. It supplies 500 human-annotated samples drawn from a much larger pool, linked by a hierarchical containment rule between question-level and step-level boxes. A page-aware protocol then scores full-answer localization separately from step-level decomposition. Results across frontier models show zero-shot performance capped at 55.22 percent on full answers and 48.22 percent on steps. A reference model fine-tuned on roughly ten thousand in-domain examples reaches 74.97 and 72.26 on the two metrics, establishing step-level grounding as an open capability gap.

Core claim

HG-Bench supplies 500 human-annotated K-12 homework samples with question-level and step-level boxes linked by hierarchical containment. Its evaluation protocol separately measures complete-answer localization (FA) and step-level decomposition (FSm). No zero-shot system exceeds 55.22 percent on FA or 48.22 percent on FSm, while a GLM-4.6V 9B model fine-tuned on approximately ten thousand in-domain examples reaches 74.97/72.26, showing that current systems do not yet capture the spatial structure of student reasoning on the page.

What carries the argument

HG-Bench benchmark of multi-page homework images annotated with hierarchically contained question-level and step-level bounding boxes, paired with the FA and FSm page-aware evaluation metrics.

If this is right

Automated homework assessment requires explicit spatial modeling of step-level structure rather than text recognition alone.
Fine-tuning on in-domain handwritten homework data produces large gains on both full-answer and step-level localization.
The hierarchical containment constraint supplies a concrete way to evaluate whether models recover the ordered structure of student reasoning.
Multi-page page-aware protocols expose limitations that single-image or text-only evaluations miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The identified gap suggests current vision-language models may require architectural additions for hierarchical spatial reasoning before they can support reliable automated grading pipelines.
HG-Bench could be extended to other subjects or languages to test whether the step-level grounding shortfall generalizes beyond the K-12 homework domain studied here.
Models that close the gap on this benchmark may also improve performance on related document tasks that involve locating ordered reasoning within noisy multi-page inputs.

Load-bearing premise

The 500 human-annotated samples drawn from the 1,489,278-image pool, together with the hierarchical containment constraint, provide a representative and correctly labeled test of page-aware answer-region grounding.

What would settle it

An independent re-annotation of the 500 samples showing low inter-annotator agreement on step-level boxes, or a zero-shot model achieving above 70 percent on both FA and FSm on the published test set.

read the original abstract

Automated homework assessment depends not only on recognizing student answers, but also on accurately locating where each answer and each intermediate reasoning step appears in noisy, multi-page handwritten work. This paper addresses the missing evaluation setting of page-aware, two-level answer-region grounding: given a sequence of homework page images, a model must localize complete answer regions and their ordered step-level subregions. We introduce HG-Bench, a benchmark of 500 human-annotated K-12 homework samples curated from a 1,489,278-image source pool, with question-level and step-level boxes linked by a hierarchical containment constraint. HG-Bench is paired with a page-aware evaluation protocol that separately measures complete-answer localization (FA) and step-level decomposition (FSm), revealing whether models truly ground the spatial structure of student reasoning rather than merely parse visible text. Across frontier closed-source APIs and competitive open-weight VLMs, no zero-shot system exceeds 55.22% on FA or 48.22% on FSm, while a GLM-4.6V 9B reference model fine-tuned on ~10k in-domain examples reaches 74.97/72.26. These results identify step-level handwritten grounding as a concrete capability gap and provide a reproducible benchmark, evaluation protocol, and trained reference point for future work on automated homework assessment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HG-Bench introduces a focused new task for hierarchical grounding in multi-page handwritten homework and shows a clear zero-shot gap, but the 500-sample curation lacks the documentation needed to fully trust the numbers.

read the letter

The main point is that this paper defines a new evaluation setting for localizing both full answers and ordered reasoning steps across sequences of handwritten K-12 homework pages, with a containment constraint between the two levels. Zero-shot frontier models top out below 56% on full-answer localization and 49% on step-level, while a fine-tuned 9B model reaches the mid-70s.

What is actually new is the exact combination of multi-page input, two-level hierarchical boxes, and the FA/FSm metrics that test whether models recover the spatial structure of student work rather than just reading text. Earlier document grounding papers do not match this setting. The paper does a clean job of releasing the benchmark, the protocol, and a reproducible fine-tuned reference so others can build on it.

The soft spot is the construction of the 500 samples. The abstract says they were curated from a 1.49 million image pool under the containment rule, but supplies no sampling frame, stratification, annotation guidelines, or agreement numbers. Without those, it is hard to know whether the performance gap is driven by model limits or by how the test set was chosen and labeled. The stress-test concern lands on the evidence given.

This is for researchers working on multimodal grounding or educational document AI. A reader who needs a concrete testbed for step-level handwritten localization will get direct value from the protocol and baseline.

I would send it to peer review. The task definition is useful and the empirical comparison is worth having, but the authors should be asked to document the curation and annotation process in detail.

Referee Report

1 major / 0 minor

Summary. The paper introduces HG-Bench, a benchmark of 500 human-annotated multi-page K-12 homework samples curated from a 1,489,278-image pool. It defines a two-level grounding task (question-level answer regions and ordered step-level subregions) under a hierarchical containment constraint, paired with a page-aware evaluation protocol that reports separate scores for complete-answer localization (FA) and step-level decomposition (FSm). Experiments show no zero-shot frontier VLM exceeds 55.22% FA or 48.22% FSm, while a GLM-4.6V 9B model fine-tuned on ~10k in-domain examples reaches 74.97/72.26, identifying step-level handwritten grounding as a capability gap.

Significance. If the annotations prove representative and correctly labeled, the benchmark supplies a reproducible, page-aware testbed that isolates spatial grounding of reasoning steps from mere text parsing. The inclusion of a trained reference model and the explicit FA/FSm split are concrete strengths that could accelerate progress on automated homework assessment.

major comments (1)

[Abstract and HG-Bench construction section] Abstract and the section describing HG-Bench construction: the 500 samples are described as 'curated' from the 1,489,278-image pool with a 'hierarchical containment constraint,' yet no sampling frame, stratification criteria, annotation guidelines, or quality metrics (e.g., inter-annotator agreement on box boundaries or containment) are supplied. Because the headline performance gap rests on these 500 annotations being both representative and correctly labeled, the absence of this documentation is load-bearing for the central empirical claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in HG-Bench construction. The single major comment is addressed below; we agree that additional documentation is required to support the benchmark's claims and will incorporate it in revision.

read point-by-point responses

Referee: [Abstract and HG-Bench construction section] Abstract and the section describing HG-Bench construction: the 500 samples are described as 'curated' from the 1,489,278-image pool with a 'hierarchical containment constraint,' yet no sampling frame, stratification criteria, annotation guidelines, or quality metrics (e.g., inter-annotator agreement on box boundaries or containment) are supplied. Because the headline performance gap rests on these 500 annotations being both representative and correctly labeled, the absence of this documentation is load-bearing for the central empirical claim.

Authors: We agree that the current manuscript lacks sufficient detail on the curation process. In the revised version we will add a new subsection (likely 3.2 or 3.3) that explicitly describes: (1) the sampling frame and stratification criteria used to select the 500 samples from the 1,489,278-image pool (including grade/subject balance and page-count distribution); (2) the full annotation guidelines provided to annotators; and (3) quality metrics, including inter-annotator agreement on both box boundaries (e.g., IoU thresholds) and the hierarchical containment constraint. These additions will directly address the load-bearing concern for the reported performance gaps. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark curation and external model evaluation contain no self-referential derivations or fitted predictions.

full rationale

The paper introduces HG-Bench by selecting 500 samples from a 1.49M pool and applying human annotation under a hierarchical containment rule, then reports direct zero-shot and fine-tuned performance numbers from external models. No equations, parameter fitting, or predictions appear; the central claim (performance gap) is an empirical measurement on the constructed test set rather than a quantity derived from or equivalent to the curation inputs by construction. No self-citations, ansatzes, or uniqueness theorems are invoked to support any derivation. This is a standard benchmark paper whose claims rest on the external validity of the annotations, not on internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a benchmark-construction paper rather than a theoretical derivation. No free parameters are fitted, no mathematical axioms are invoked beyond standard annotation practices, and no new entities are postulated.

pith-pipeline@v0.9.1-grok · 5802 in / 1210 out tokens · 19164 ms · 2026-06-25T21:37:57.263061+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 4 linked inside Pith

[1]

ReferItGame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame: Referring to objects in photographs of natural scenes. InEMNLP, 2014

2014
[2]

Berg, and Tamara L

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. Modeling context in referring expressions. InECCV, 2016

2016
[3]

Gen- eration and comprehension of unambiguous object descriptions

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille, and Kevin Murphy. Gen- eration and comprehension of unambiguous object descriptions. InCVPR, 2016

2016
[4]

Plummer, Liwei Wang, Chris M

Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30K Entities: Collecting region-to- phrasecorrespondencesforricherimage-to-sentence models. InICCV, 2015

2015
[5]

Visual Genome: Connecting language and vision using crowdsourced dense im- age annotations.IJCV, 123(1):32–73, 2017

RanjayKrishna, YukeZhu, OliverGroth, JustinJohn- son, Kenji Hata, et al. Visual Genome: Connecting language and vision using crowdsourced dense im- age annotations.IJCV, 123(1):32–73, 2017

2017
[6]

Grounded language–image pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Li- juan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language–image pre-training. InCVPR, 2022

2022
[7]

Grounding DINO: MarryingDINOwithgroundedpre-trainingforopen- set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: MarryingDINOwithgroundedpre-trainingforopen- set object detection. InECCV, 2024

2024
[8]

Shikra: Unleash- ing multimodal LLM’s referential dialogue magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleash- ing multimodal LLM’s referential dialogue magic. arXiv:2306.15195, 2023

Pith/arXiv arXiv 2023
[9]

Ferret: Refer and ground anything anywhere at any granularity

Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity. InICLR, 2024

2024
[10]

Minesh Mathew, Dimosthenis Karatzas, and C.V. Jawahar. DocVQA: A dataset for VQA on document images. InWACV, 2021

2021
[11]

ChartQA: A benchmark for question answering about charts with visual and logical reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of ACL, 2022

2022
[12]

Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimos- thenis Karatzas, Ernest Valveny, and C.V. Jawahar. InfographicVQA. InWACV, 2022

2022
[13]

Marti and Horst Bunke

U.-V. Marti and Horst Bunke. The IAM-database: An English sentence database for offline handwriting recognition.IJDAR, 5(1):39–46, 2002

2002
[14]

FUNSD: A dataset for form under- standing in noisy scanned documents

Guillaume Jaume, Hazim Kemal Ekenel, and Jean- Philippe Thiran. FUNSD: A dataset for form under- standing in noisy scanned documents. InICDAR Workshops, 2019

2019
[15]

CASIAonlineandofflineChinesehandwriting databases

Cheng-Lin Liu, Fei Yin, Da-Han Wang, and Qiu-Feng Wang. CASIAonlineandofflineChinesehandwriting databases. InICDAR, 2013

2013
[16]

ICDAR 2019 CROHME + TFD: Competi- tion on recognition of handwritten mathematical expressions and typeset formula detection

Mahshad Mahdavi, Richard Zanibbi, Harold Mouchere, Christian Viard-Gaudin, and Utpal Garain. ICDAR 2019 CROHME + TFD: Competi- tion on recognition of handwritten mathematical expressions and typeset formula detection. InIC- DAR, 2019

2019
[17]

LayoutLMv3: Pre-training for document AI with unified text and image masking

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. LayoutLMv3: Pre-training for document AI with unified text and image masking. InACM Multimedia, 2022

2022
[18]

OCR-free document understand- ing transformer

Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Won- seok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. OCR-free document understand- ing transformer. InECCV, 2022

2022
[19]

Pix2Struct: Screenshot parsing as pre- training for visual language understanding

Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandel- wal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2Struct: Screenshot parsing as pre- training for visual language understanding. InICML, 2023

2023
[20]

UReader: Uni- versal OCR-free visually-situated language under- standing with multimodal large language model

Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chen- liang Li, Junfeng Tian, Qi Qian, Ji Zhang, Qin Jin, Liang He, Xin Lin, and Fei Huang. UReader: Uni- versal OCR-free visually-situated language under- standing with multimodal large language model. In Findings of EMNLP, 2023

2023
[21]

GPT-4V(ision) system card

OpenAI. GPT-4V(ision) system card. Technical re- port, 2024

2024
[22]

GPT-5.4: System card and deployment notes

OpenAI. GPT-5.4: System card and deployment notes. Technical report, OpenAI, 2026. https: //openai.com/index/gpt-5-system-card/

2026
[23]

Gemini: A family of highly capa- ble multimodal models

Google DeepMind. Gemini: A family of highly capa- ble multimodal models. Technical report, 2024

2024
[24]

The Claude 3 model family: Opus, Son- net, Haiku

Anthropic. The Claude 3 model family: Opus, Son- net, Haiku. Technical report, 2024

2024
[25]

System card: Claude Sonnet 4.6

Anthropic. System card: Claude Sonnet 4.6. Technical report, Anthropic, February 17, 2026. https://www.anthropic.com/ claude-sonnet-4-6-system-card

2026
[26]

Seed1.5-VL technical report

ByteDance Seed Team. Seed1.5-VL technical report. arXiv preprint arXiv:2505.07062, 2025.https:// arxiv.org/abs/2505.07062

Pith/arXiv arXiv 2025
[27]

Kimi K2.5: Visual agentic intelligence

Kimi Team. Kimi K2.5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276, 2026.https:// 10 HG-Bench: Multi-Page Handwritten Answer-Region Grounding arxiv.org/abs/2602.02276

Pith/arXiv arXiv 2026
[28]

Qwen2-VL: Enhancing vision–language model’s perception of the world at any resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhi- hao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-VL: Enhancing vision–language model’s perception of the world at any resolution. Technical report, 2024

2024
[29]

Qwen2.5- VL technical report

Shuai Bai, Keqin Chen, Xuejing Liu, et al. Qwen2.5- VL technical report. Technical report, 2025

2025
[30]

InternVL: Scaling up vision foundation models and aligning for generic visual– linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, GuoChen,SenXing,MuyanZhong,QinglongZhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up vision foundation models and aligning for generic visual– linguistic tasks. InCVPR, 2024

2024
[31]

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling (In- ternVL 2.5)

Zhe Chen, Weiyun Wang, Yue Cao, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling (In- ternVL 2.5). Technical report, 2025

2025
[32]

CogVLM2: Visual language models for image and video understanding

Weihan Wang, Wenyi Hong, Yean Cheng, et al. CogVLM2: Visual language models for image and video understanding. Technical report, 2024

2024
[33]

MiniCPM-V: A GPT-4V level MLLM on your phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, et al. MiniCPM-V: A GPT-4V level MLLM on your phone. Technical report, 2024

2024
[34]

Florence-2: Advancing a unified represen- tation for a variety of vision tasks

Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified represen- tation for a variety of vision tasks. InCVPR, 2024

2024
[35]

LLaVA- NeXT: Improved reasoning, OCR, and world knowl- edge

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuan- han Zhang, Sheng Shen, and Yong Jae Lee. LLaVA- NeXT: Improved reasoning, OCR, and world knowl- edge. Technical report, 2024

2024
[36]

DeepSeek-VL2: Mixture-of-experts vision–language models for advanced multimodal understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, et al. DeepSeek-VL2: Mixture-of-experts vision–language models for advanced multimodal understanding. Technical report, 2024

2024
[37]

Phi-3 technical report: A highly capable language model locally on your phone

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, et al. Phi-3 technical report: A highly capable language model locally on your phone. Technical report, 2024

2024
[38]

GLM-4.5V and GLM-4.1V-Thinking: Towards versatile multimodal reasoning with scal- able reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

GLM-V Team. GLM-4.5V and GLM-4.1V-Thinking: Towards versatile multimodal reasoning with scal- able reinforcement learning.arXiv preprint arXiv:2507.01006, 2025. https://arxiv.org/ abs/2507.01006

Pith/arXiv arXiv 2025
[39]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, et al. Learning transferable visual models from natural language supervision. InICML, 2021

2021
[40]

Le, Yunhsuan Sung, Zhen Li, and Tom Duerig

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InICML, 2021

2021
[41]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

NoamShazeer, AzaliaMirhoseini, KrzysztofMaziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InICLR, 2017

2017
[42]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR, 23(120):1–39, 2022

2022
[43]

Jiang, Alexandre Sablayrolles, Antoine Roux, et al

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, et al. Mixtral of Experts. Technical report, 2024

2024
[44]

Xu, Huazuo Gao, Deli Chen, et al

Damai Dai, Chengqi Deng, Chenggang Zhao, R.X. Xu, Huazuo Gao, Deli Chen, et al. DeepSeekMoE: Towards ultimate expert specialization in mixture- of-experts language models. InACL, 2024

2024
[45]

MathVista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. InICLR, 2024

2024
[46]

Math- Verse: Does your multi-modal LLM truly see the diagrams in visual math problems? InECCV, 2024

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. Math- Verse: Does your multi-modal LLM truly see the diagrams in visual math problems? InECCV, 2024

2024
[47]

We- Math: Does your large multimodal model achieve human-like mathematical reasoning? Technical re- port, 2024

Runqi Qiao, Qiuna Tan, Guanting Dong, et al. We- Math: Does your large multimodal model achieve human-like mathematical reasoning? Technical re- port, 2024

2024
[48]

Measuring multi- modal mathematical reasoning with MATH-Vision dataset

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multi- modal mathematical reasoning with MATH-Vision dataset. InNeurIPS, 2024

2024
[49]

OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific prob- lems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific prob- lems. InACL, 2024

2024
[50]

MMBench: Is your multi-modal model an all- around player? InECCV, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. MMBench: Is your multi-modal model an all- around player? InECCV, 2024

2024
[51]

SEED-Bench: Benchmarking multimodal LLMs with generative comprehension

BohaoLi,RuiWang,GuangzhiWang,YuyingGe,Yix- iao Ge, and Ying Shan. SEED-Bench: Benchmarking multimodal LLMs with generative comprehension. Technical report, 2023. 11 HG-Bench: Multi-Page Handwritten Answer-Region Grounding

2023
[52]

MMMU: A massive multi- discipline multimodal understanding and reasoning benchmark for expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, et al. MMMU: A massive multi- discipline multimodal understanding and reasoning benchmark for expert AGI. InCVPR, 2024

2024
[53]

Holis- tic evaluation of language models.TMLR, 2023

PercyLiang, RishiBommasani, TonyLee, etal. Holis- tic evaluation of language models.TMLR, 2023

2023
[54]

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. TMLR, 2023

2023
[55]

A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

1960
[56]

box_2d": [100, 200, 180, 300], 4

Joseph L. Fleiss. Measuring nominal scale agree- ment among many raters.Psychological Bulletin, 76(5):378–382, 1971. 12 HG-Bench: Multi-Page Handwritten Answer-Region Grounding A Annotation Guidelines (Excerpts) This appendix summarizes the core rules used by the annotator pool (Section 4). The full guideline document is released with the benchmark. Skip ...

1971

[1] [1]

ReferItGame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame: Referring to objects in photographs of natural scenes. InEMNLP, 2014

2014

[2] [2]

Berg, and Tamara L

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. Modeling context in referring expressions. InECCV, 2016

2016

[3] [3]

Gen- eration and comprehension of unambiguous object descriptions

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille, and Kevin Murphy. Gen- eration and comprehension of unambiguous object descriptions. InCVPR, 2016

2016

[4] [4]

Plummer, Liwei Wang, Chris M

Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30K Entities: Collecting region-to- phrasecorrespondencesforricherimage-to-sentence models. InICCV, 2015

2015

[5] [5]

Visual Genome: Connecting language and vision using crowdsourced dense im- age annotations.IJCV, 123(1):32–73, 2017

RanjayKrishna, YukeZhu, OliverGroth, JustinJohn- son, Kenji Hata, et al. Visual Genome: Connecting language and vision using crowdsourced dense im- age annotations.IJCV, 123(1):32–73, 2017

2017

[6] [6]

Grounded language–image pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Li- juan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language–image pre-training. InCVPR, 2022

2022

[7] [7]

Grounding DINO: MarryingDINOwithgroundedpre-trainingforopen- set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: MarryingDINOwithgroundedpre-trainingforopen- set object detection. InECCV, 2024

2024

[8] [8]

Shikra: Unleash- ing multimodal LLM’s referential dialogue magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleash- ing multimodal LLM’s referential dialogue magic. arXiv:2306.15195, 2023

Pith/arXiv arXiv 2023

[9] [9]

Ferret: Refer and ground anything anywhere at any granularity

Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity. InICLR, 2024

2024

[10] [10]

Minesh Mathew, Dimosthenis Karatzas, and C.V. Jawahar. DocVQA: A dataset for VQA on document images. InWACV, 2021

2021

[11] [11]

ChartQA: A benchmark for question answering about charts with visual and logical reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of ACL, 2022

2022

[12] [12]

Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimos- thenis Karatzas, Ernest Valveny, and C.V. Jawahar. InfographicVQA. InWACV, 2022

2022

[13] [13]

Marti and Horst Bunke

U.-V. Marti and Horst Bunke. The IAM-database: An English sentence database for offline handwriting recognition.IJDAR, 5(1):39–46, 2002

2002

[14] [14]

FUNSD: A dataset for form under- standing in noisy scanned documents

Guillaume Jaume, Hazim Kemal Ekenel, and Jean- Philippe Thiran. FUNSD: A dataset for form under- standing in noisy scanned documents. InICDAR Workshops, 2019

2019

[15] [15]

CASIAonlineandofflineChinesehandwriting databases

Cheng-Lin Liu, Fei Yin, Da-Han Wang, and Qiu-Feng Wang. CASIAonlineandofflineChinesehandwriting databases. InICDAR, 2013

2013

[16] [16]

ICDAR 2019 CROHME + TFD: Competi- tion on recognition of handwritten mathematical expressions and typeset formula detection

Mahshad Mahdavi, Richard Zanibbi, Harold Mouchere, Christian Viard-Gaudin, and Utpal Garain. ICDAR 2019 CROHME + TFD: Competi- tion on recognition of handwritten mathematical expressions and typeset formula detection. InIC- DAR, 2019

2019

[17] [17]

LayoutLMv3: Pre-training for document AI with unified text and image masking

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. LayoutLMv3: Pre-training for document AI with unified text and image masking. InACM Multimedia, 2022

2022

[18] [18]

OCR-free document understand- ing transformer

Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Won- seok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. OCR-free document understand- ing transformer. InECCV, 2022

2022

[19] [19]

Pix2Struct: Screenshot parsing as pre- training for visual language understanding

Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandel- wal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2Struct: Screenshot parsing as pre- training for visual language understanding. InICML, 2023

2023

[20] [20]

UReader: Uni- versal OCR-free visually-situated language under- standing with multimodal large language model

Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chen- liang Li, Junfeng Tian, Qi Qian, Ji Zhang, Qin Jin, Liang He, Xin Lin, and Fei Huang. UReader: Uni- versal OCR-free visually-situated language under- standing with multimodal large language model. In Findings of EMNLP, 2023

2023

[21] [21]

GPT-4V(ision) system card

OpenAI. GPT-4V(ision) system card. Technical re- port, 2024

2024

[22] [22]

GPT-5.4: System card and deployment notes

OpenAI. GPT-5.4: System card and deployment notes. Technical report, OpenAI, 2026. https: //openai.com/index/gpt-5-system-card/

2026

[23] [23]

Gemini: A family of highly capa- ble multimodal models

Google DeepMind. Gemini: A family of highly capa- ble multimodal models. Technical report, 2024

2024

[24] [24]

The Claude 3 model family: Opus, Son- net, Haiku

Anthropic. The Claude 3 model family: Opus, Son- net, Haiku. Technical report, 2024

2024

[25] [25]

System card: Claude Sonnet 4.6

Anthropic. System card: Claude Sonnet 4.6. Technical report, Anthropic, February 17, 2026. https://www.anthropic.com/ claude-sonnet-4-6-system-card

2026

[26] [26]

Seed1.5-VL technical report

ByteDance Seed Team. Seed1.5-VL technical report. arXiv preprint arXiv:2505.07062, 2025.https:// arxiv.org/abs/2505.07062

Pith/arXiv arXiv 2025

[27] [27]

Kimi K2.5: Visual agentic intelligence

Kimi Team. Kimi K2.5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276, 2026.https:// 10 HG-Bench: Multi-Page Handwritten Answer-Region Grounding arxiv.org/abs/2602.02276

Pith/arXiv arXiv 2026

[28] [28]

Qwen2-VL: Enhancing vision–language model’s perception of the world at any resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhi- hao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-VL: Enhancing vision–language model’s perception of the world at any resolution. Technical report, 2024

2024

[29] [29]

Qwen2.5- VL technical report

Shuai Bai, Keqin Chen, Xuejing Liu, et al. Qwen2.5- VL technical report. Technical report, 2025

2025

[30] [30]

InternVL: Scaling up vision foundation models and aligning for generic visual– linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, GuoChen,SenXing,MuyanZhong,QinglongZhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up vision foundation models and aligning for generic visual– linguistic tasks. InCVPR, 2024

2024

[31] [31]

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling (In- ternVL 2.5)

Zhe Chen, Weiyun Wang, Yue Cao, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling (In- ternVL 2.5). Technical report, 2025

2025

[32] [32]

CogVLM2: Visual language models for image and video understanding

Weihan Wang, Wenyi Hong, Yean Cheng, et al. CogVLM2: Visual language models for image and video understanding. Technical report, 2024

2024

[33] [33]

MiniCPM-V: A GPT-4V level MLLM on your phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, et al. MiniCPM-V: A GPT-4V level MLLM on your phone. Technical report, 2024

2024

[34] [34]

Florence-2: Advancing a unified represen- tation for a variety of vision tasks

Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified represen- tation for a variety of vision tasks. InCVPR, 2024

2024

[35] [35]

LLaVA- NeXT: Improved reasoning, OCR, and world knowl- edge

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuan- han Zhang, Sheng Shen, and Yong Jae Lee. LLaVA- NeXT: Improved reasoning, OCR, and world knowl- edge. Technical report, 2024

2024

[36] [36]

DeepSeek-VL2: Mixture-of-experts vision–language models for advanced multimodal understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, et al. DeepSeek-VL2: Mixture-of-experts vision–language models for advanced multimodal understanding. Technical report, 2024

2024

[37] [37]

Phi-3 technical report: A highly capable language model locally on your phone

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, et al. Phi-3 technical report: A highly capable language model locally on your phone. Technical report, 2024

2024

[38] [38]

GLM-4.5V and GLM-4.1V-Thinking: Towards versatile multimodal reasoning with scal- able reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

GLM-V Team. GLM-4.5V and GLM-4.1V-Thinking: Towards versatile multimodal reasoning with scal- able reinforcement learning.arXiv preprint arXiv:2507.01006, 2025. https://arxiv.org/ abs/2507.01006

Pith/arXiv arXiv 2025

[39] [39]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, et al. Learning transferable visual models from natural language supervision. InICML, 2021

2021

[40] [40]

Le, Yunhsuan Sung, Zhen Li, and Tom Duerig

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InICML, 2021

2021

[41] [41]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

NoamShazeer, AzaliaMirhoseini, KrzysztofMaziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InICLR, 2017

2017

[42] [42]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR, 23(120):1–39, 2022

2022

[43] [43]

Jiang, Alexandre Sablayrolles, Antoine Roux, et al

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, et al. Mixtral of Experts. Technical report, 2024

2024

[44] [44]

Xu, Huazuo Gao, Deli Chen, et al

Damai Dai, Chengqi Deng, Chenggang Zhao, R.X. Xu, Huazuo Gao, Deli Chen, et al. DeepSeekMoE: Towards ultimate expert specialization in mixture- of-experts language models. InACL, 2024

2024

[45] [45]

MathVista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. InICLR, 2024

2024

[46] [46]

Math- Verse: Does your multi-modal LLM truly see the diagrams in visual math problems? InECCV, 2024

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. Math- Verse: Does your multi-modal LLM truly see the diagrams in visual math problems? InECCV, 2024

2024

[47] [47]

We- Math: Does your large multimodal model achieve human-like mathematical reasoning? Technical re- port, 2024

Runqi Qiao, Qiuna Tan, Guanting Dong, et al. We- Math: Does your large multimodal model achieve human-like mathematical reasoning? Technical re- port, 2024

2024

[48] [48]

Measuring multi- modal mathematical reasoning with MATH-Vision dataset

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multi- modal mathematical reasoning with MATH-Vision dataset. InNeurIPS, 2024

2024

[49] [49]

OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific prob- lems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific prob- lems. InACL, 2024

2024

[50] [50]

MMBench: Is your multi-modal model an all- around player? InECCV, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. MMBench: Is your multi-modal model an all- around player? InECCV, 2024

2024

[51] [51]

SEED-Bench: Benchmarking multimodal LLMs with generative comprehension

BohaoLi,RuiWang,GuangzhiWang,YuyingGe,Yix- iao Ge, and Ying Shan. SEED-Bench: Benchmarking multimodal LLMs with generative comprehension. Technical report, 2023. 11 HG-Bench: Multi-Page Handwritten Answer-Region Grounding

2023

[52] [52]

MMMU: A massive multi- discipline multimodal understanding and reasoning benchmark for expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, et al. MMMU: A massive multi- discipline multimodal understanding and reasoning benchmark for expert AGI. InCVPR, 2024

2024

[53] [53]

Holis- tic evaluation of language models.TMLR, 2023

PercyLiang, RishiBommasani, TonyLee, etal. Holis- tic evaluation of language models.TMLR, 2023

2023

[54] [54]

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. TMLR, 2023

2023

[55] [55]

A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

1960

[56] [56]

box_2d": [100, 200, 180, 300], 4

Joseph L. Fleiss. Measuring nominal scale agree- ment among many raters.Psychological Bulletin, 76(5):378–382, 1971. 12 HG-Bench: Multi-Page Handwritten Answer-Region Grounding A Annotation Guidelines (Excerpts) This appendix summarizes the core rules used by the annotator pool (Section 4). The full guideline document is released with the benchmark. Skip ...

1971