pith. sign in

arxiv: 2606.25491 · v1 · pith:Q5TO3UEWnew · submitted 2026-06-24 · 💻 cs.CV · cs.AI

HG-Bench: A Benchmark for Multi-Page Handwritten Answer-Region Grounding in Automated Homework Assessment

Pith reviewed 2026-06-25 21:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords handwritten answer groundingmulti-page document analysishomework assessmentvision-language modelsbenchmarkstep-level localizationhierarchical bounding boxesautomated grading
0
0 comments X

The pith

HG-Bench reveals that no zero-shot vision-language model exceeds 55 percent on locating full answers or 48 percent on their reasoning steps in multi-page handwritten homework.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HG-Bench to test whether models can locate both complete answer regions and the ordered step-level subregions inside them across sequences of homework page images. It supplies 500 human-annotated samples drawn from a much larger pool, linked by a hierarchical containment rule between question-level and step-level boxes. A page-aware protocol then scores full-answer localization separately from step-level decomposition. Results across frontier models show zero-shot performance capped at 55.22 percent on full answers and 48.22 percent on steps. A reference model fine-tuned on roughly ten thousand in-domain examples reaches 74.97 and 72.26 on the two metrics, establishing step-level grounding as an open capability gap.

Core claim

HG-Bench supplies 500 human-annotated K-12 homework samples with question-level and step-level boxes linked by hierarchical containment. Its evaluation protocol separately measures complete-answer localization (FA) and step-level decomposition (FSm). No zero-shot system exceeds 55.22 percent on FA or 48.22 percent on FSm, while a GLM-4.6V 9B model fine-tuned on approximately ten thousand in-domain examples reaches 74.97/72.26, showing that current systems do not yet capture the spatial structure of student reasoning on the page.

What carries the argument

HG-Bench benchmark of multi-page homework images annotated with hierarchically contained question-level and step-level bounding boxes, paired with the FA and FSm page-aware evaluation metrics.

If this is right

  • Automated homework assessment requires explicit spatial modeling of step-level structure rather than text recognition alone.
  • Fine-tuning on in-domain handwritten homework data produces large gains on both full-answer and step-level localization.
  • The hierarchical containment constraint supplies a concrete way to evaluate whether models recover the ordered structure of student reasoning.
  • Multi-page page-aware protocols expose limitations that single-image or text-only evaluations miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The identified gap suggests current vision-language models may require architectural additions for hierarchical spatial reasoning before they can support reliable automated grading pipelines.
  • HG-Bench could be extended to other subjects or languages to test whether the step-level grounding shortfall generalizes beyond the K-12 homework domain studied here.
  • Models that close the gap on this benchmark may also improve performance on related document tasks that involve locating ordered reasoning within noisy multi-page inputs.

Load-bearing premise

The 500 human-annotated samples drawn from the 1,489,278-image pool, together with the hierarchical containment constraint, provide a representative and correctly labeled test of page-aware answer-region grounding.

What would settle it

An independent re-annotation of the 500 samples showing low inter-annotator agreement on step-level boxes, or a zero-shot model achieving above 70 percent on both FA and FSm on the published test set.

read the original abstract

Automated homework assessment depends not only on recognizing student answers, but also on accurately locating where each answer and each intermediate reasoning step appears in noisy, multi-page handwritten work. This paper addresses the missing evaluation setting of page-aware, two-level answer-region grounding: given a sequence of homework page images, a model must localize complete answer regions and their ordered step-level subregions. We introduce HG-Bench, a benchmark of 500 human-annotated K-12 homework samples curated from a 1,489,278-image source pool, with question-level and step-level boxes linked by a hierarchical containment constraint. HG-Bench is paired with a page-aware evaluation protocol that separately measures complete-answer localization (FA) and step-level decomposition (FSm), revealing whether models truly ground the spatial structure of student reasoning rather than merely parse visible text. Across frontier closed-source APIs and competitive open-weight VLMs, no zero-shot system exceeds 55.22% on FA or 48.22% on FSm, while a GLM-4.6V 9B reference model fine-tuned on ~10k in-domain examples reaches 74.97/72.26. These results identify step-level handwritten grounding as a concrete capability gap and provide a reproducible benchmark, evaluation protocol, and trained reference point for future work on automated homework assessment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces HG-Bench, a benchmark of 500 human-annotated multi-page K-12 homework samples curated from a 1,489,278-image pool. It defines a two-level grounding task (question-level answer regions and ordered step-level subregions) under a hierarchical containment constraint, paired with a page-aware evaluation protocol that reports separate scores for complete-answer localization (FA) and step-level decomposition (FSm). Experiments show no zero-shot frontier VLM exceeds 55.22% FA or 48.22% FSm, while a GLM-4.6V 9B model fine-tuned on ~10k in-domain examples reaches 74.97/72.26, identifying step-level handwritten grounding as a capability gap.

Significance. If the annotations prove representative and correctly labeled, the benchmark supplies a reproducible, page-aware testbed that isolates spatial grounding of reasoning steps from mere text parsing. The inclusion of a trained reference model and the explicit FA/FSm split are concrete strengths that could accelerate progress on automated homework assessment.

major comments (1)
  1. [Abstract and HG-Bench construction section] Abstract and the section describing HG-Bench construction: the 500 samples are described as 'curated' from the 1,489,278-image pool with a 'hierarchical containment constraint,' yet no sampling frame, stratification criteria, annotation guidelines, or quality metrics (e.g., inter-annotator agreement on box boundaries or containment) are supplied. Because the headline performance gap rests on these 500 annotations being both representative and correctly labeled, the absence of this documentation is load-bearing for the central empirical claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in HG-Bench construction. The single major comment is addressed below; we agree that additional documentation is required to support the benchmark's claims and will incorporate it in revision.

read point-by-point responses
  1. Referee: [Abstract and HG-Bench construction section] Abstract and the section describing HG-Bench construction: the 500 samples are described as 'curated' from the 1,489,278-image pool with a 'hierarchical containment constraint,' yet no sampling frame, stratification criteria, annotation guidelines, or quality metrics (e.g., inter-annotator agreement on box boundaries or containment) are supplied. Because the headline performance gap rests on these 500 annotations being both representative and correctly labeled, the absence of this documentation is load-bearing for the central empirical claim.

    Authors: We agree that the current manuscript lacks sufficient detail on the curation process. In the revised version we will add a new subsection (likely 3.2 or 3.3) that explicitly describes: (1) the sampling frame and stratification criteria used to select the 500 samples from the 1,489,278-image pool (including grade/subject balance and page-count distribution); (2) the full annotation guidelines provided to annotators; and (3) quality metrics, including inter-annotator agreement on both box boundaries (e.g., IoU thresholds) and the hierarchical containment constraint. These additions will directly address the load-bearing concern for the reported performance gaps. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark curation and external model evaluation contain no self-referential derivations or fitted predictions.

full rationale

The paper introduces HG-Bench by selecting 500 samples from a 1.49M pool and applying human annotation under a hierarchical containment rule, then reports direct zero-shot and fine-tuned performance numbers from external models. No equations, parameter fitting, or predictions appear; the central claim (performance gap) is an empirical measurement on the constructed test set rather than a quantity derived from or equivalent to the curation inputs by construction. No self-citations, ansatzes, or uniqueness theorems are invoked to support any derivation. This is a standard benchmark paper whose claims rest on the external validity of the annotations, not on internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a benchmark-construction paper rather than a theoretical derivation. No free parameters are fitted, no mathematical axioms are invoked beyond standard annotation practices, and no new entities are postulated.

pith-pipeline@v0.9.1-grok · 5802 in / 1210 out tokens · 19164 ms · 2026-06-25T21:37:57.263061+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 4 linked inside Pith

  1. [1]

    ReferItGame: Referring to objects in photographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame: Referring to objects in photographs of natural scenes. InEMNLP, 2014

  2. [2]

    Berg, and Tamara L

    Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. Modeling context in referring expressions. InECCV, 2016

  3. [3]

    Gen- eration and comprehension of unambiguous object descriptions

    Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille, and Kevin Murphy. Gen- eration and comprehension of unambiguous object descriptions. InCVPR, 2016

  4. [4]

    Plummer, Liwei Wang, Chris M

    Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30K Entities: Collecting region-to- phrasecorrespondencesforricherimage-to-sentence models. InICCV, 2015

  5. [5]

    Visual Genome: Connecting language and vision using crowdsourced dense im- age annotations.IJCV, 123(1):32–73, 2017

    RanjayKrishna, YukeZhu, OliverGroth, JustinJohn- son, Kenji Hata, et al. Visual Genome: Connecting language and vision using crowdsourced dense im- age annotations.IJCV, 123(1):32–73, 2017

  6. [6]

    Grounded language–image pre-training

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Li- juan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language–image pre-training. InCVPR, 2022

  7. [7]

    Grounding DINO: MarryingDINOwithgroundedpre-trainingforopen- set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: MarryingDINOwithgroundedpre-trainingforopen- set object detection. InECCV, 2024

  8. [8]

    Shikra: Unleash- ing multimodal LLM’s referential dialogue magic

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleash- ing multimodal LLM’s referential dialogue magic. arXiv:2306.15195, 2023

  9. [9]

    Ferret: Refer and ground anything anywhere at any granularity

    Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity. InICLR, 2024

  10. [10]

    Minesh Mathew, Dimosthenis Karatzas, and C.V. Jawahar. DocVQA: A dataset for VQA on document images. InWACV, 2021

  11. [11]

    ChartQA: A benchmark for question answering about charts with visual and logical reasoning

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of ACL, 2022

  12. [12]

    Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimos- thenis Karatzas, Ernest Valveny, and C.V. Jawahar. InfographicVQA. InWACV, 2022

  13. [13]

    Marti and Horst Bunke

    U.-V. Marti and Horst Bunke. The IAM-database: An English sentence database for offline handwriting recognition.IJDAR, 5(1):39–46, 2002

  14. [14]

    FUNSD: A dataset for form under- standing in noisy scanned documents

    Guillaume Jaume, Hazim Kemal Ekenel, and Jean- Philippe Thiran. FUNSD: A dataset for form under- standing in noisy scanned documents. InICDAR Workshops, 2019

  15. [15]

    CASIAonlineandofflineChinesehandwriting databases

    Cheng-Lin Liu, Fei Yin, Da-Han Wang, and Qiu-Feng Wang. CASIAonlineandofflineChinesehandwriting databases. InICDAR, 2013

  16. [16]

    ICDAR 2019 CROHME + TFD: Competi- tion on recognition of handwritten mathematical expressions and typeset formula detection

    Mahshad Mahdavi, Richard Zanibbi, Harold Mouchere, Christian Viard-Gaudin, and Utpal Garain. ICDAR 2019 CROHME + TFD: Competi- tion on recognition of handwritten mathematical expressions and typeset formula detection. InIC- DAR, 2019

  17. [17]

    LayoutLMv3: Pre-training for document AI with unified text and image masking

    Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. LayoutLMv3: Pre-training for document AI with unified text and image masking. InACM Multimedia, 2022

  18. [18]

    OCR-free document understand- ing transformer

    Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Won- seok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. OCR-free document understand- ing transformer. InECCV, 2022

  19. [19]

    Pix2Struct: Screenshot parsing as pre- training for visual language understanding

    Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandel- wal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2Struct: Screenshot parsing as pre- training for visual language understanding. InICML, 2023

  20. [20]

    UReader: Uni- versal OCR-free visually-situated language under- standing with multimodal large language model

    Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chen- liang Li, Junfeng Tian, Qi Qian, Ji Zhang, Qin Jin, Liang He, Xin Lin, and Fei Huang. UReader: Uni- versal OCR-free visually-situated language under- standing with multimodal large language model. In Findings of EMNLP, 2023

  21. [21]

    GPT-4V(ision) system card

    OpenAI. GPT-4V(ision) system card. Technical re- port, 2024

  22. [22]

    GPT-5.4: System card and deployment notes

    OpenAI. GPT-5.4: System card and deployment notes. Technical report, OpenAI, 2026. https: //openai.com/index/gpt-5-system-card/

  23. [23]

    Gemini: A family of highly capa- ble multimodal models

    Google DeepMind. Gemini: A family of highly capa- ble multimodal models. Technical report, 2024

  24. [24]

    The Claude 3 model family: Opus, Son- net, Haiku

    Anthropic. The Claude 3 model family: Opus, Son- net, Haiku. Technical report, 2024

  25. [25]

    System card: Claude Sonnet 4.6

    Anthropic. System card: Claude Sonnet 4.6. Technical report, Anthropic, February 17, 2026. https://www.anthropic.com/ claude-sonnet-4-6-system-card

  26. [26]

    Seed1.5-VL technical report

    ByteDance Seed Team. Seed1.5-VL technical report. arXiv preprint arXiv:2505.07062, 2025.https:// arxiv.org/abs/2505.07062

  27. [27]

    Kimi K2.5: Visual agentic intelligence

    Kimi Team. Kimi K2.5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276, 2026.https:// 10 HG-Bench: Multi-Page Handwritten Answer-Region Grounding arxiv.org/abs/2602.02276

  28. [28]

    Qwen2-VL: Enhancing vision–language model’s perception of the world at any resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhi- hao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-VL: Enhancing vision–language model’s perception of the world at any resolution. Technical report, 2024

  29. [29]

    Qwen2.5- VL technical report

    Shuai Bai, Keqin Chen, Xuejing Liu, et al. Qwen2.5- VL technical report. Technical report, 2025

  30. [30]

    InternVL: Scaling up vision foundation models and aligning for generic visual– linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, GuoChen,SenXing,MuyanZhong,QinglongZhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up vision foundation models and aligning for generic visual– linguistic tasks. InCVPR, 2024

  31. [31]

    Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling (In- ternVL 2.5)

    Zhe Chen, Weiyun Wang, Yue Cao, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling (In- ternVL 2.5). Technical report, 2025

  32. [32]

    CogVLM2: Visual language models for image and video understanding

    Weihan Wang, Wenyi Hong, Yean Cheng, et al. CogVLM2: Visual language models for image and video understanding. Technical report, 2024

  33. [33]

    MiniCPM-V: A GPT-4V level MLLM on your phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, et al. MiniCPM-V: A GPT-4V level MLLM on your phone. Technical report, 2024

  34. [34]

    Florence-2: Advancing a unified represen- tation for a variety of vision tasks

    Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified represen- tation for a variety of vision tasks. InCVPR, 2024

  35. [35]

    LLaVA- NeXT: Improved reasoning, OCR, and world knowl- edge

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuan- han Zhang, Sheng Shen, and Yong Jae Lee. LLaVA- NeXT: Improved reasoning, OCR, and world knowl- edge. Technical report, 2024

  36. [36]

    DeepSeek-VL2: Mixture-of-experts vision–language models for advanced multimodal understanding

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, et al. DeepSeek-VL2: Mixture-of-experts vision–language models for advanced multimodal understanding. Technical report, 2024

  37. [37]

    Phi-3 technical report: A highly capable language model locally on your phone

    Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, et al. Phi-3 technical report: A highly capable language model locally on your phone. Technical report, 2024

  38. [38]

    GLM-4.5V and GLM-4.1V-Thinking: Towards versatile multimodal reasoning with scal- able reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

    GLM-V Team. GLM-4.5V and GLM-4.1V-Thinking: Towards versatile multimodal reasoning with scal- able reinforcement learning.arXiv preprint arXiv:2507.01006, 2025. https://arxiv.org/ abs/2507.01006

  39. [39]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, et al. Learning transferable visual models from natural language supervision. InICML, 2021

  40. [40]

    Le, Yunhsuan Sung, Zhen Li, and Tom Duerig

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InICML, 2021

  41. [41]

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

    NoamShazeer, AzaliaMirhoseini, KrzysztofMaziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InICLR, 2017

  42. [42]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR, 23(120):1–39, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR, 23(120):1–39, 2022

  43. [43]

    Jiang, Alexandre Sablayrolles, Antoine Roux, et al

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, et al. Mixtral of Experts. Technical report, 2024

  44. [44]

    Xu, Huazuo Gao, Deli Chen, et al

    Damai Dai, Chengqi Deng, Chenggang Zhao, R.X. Xu, Huazuo Gao, Deli Chen, et al. DeepSeekMoE: Towards ultimate expert specialization in mixture- of-experts language models. InACL, 2024

  45. [45]

    MathVista: Evaluating mathematical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. InICLR, 2024

  46. [46]

    Math- Verse: Does your multi-modal LLM truly see the diagrams in visual math problems? InECCV, 2024

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. Math- Verse: Does your multi-modal LLM truly see the diagrams in visual math problems? InECCV, 2024

  47. [47]

    We- Math: Does your large multimodal model achieve human-like mathematical reasoning? Technical re- port, 2024

    Runqi Qiao, Qiuna Tan, Guanting Dong, et al. We- Math: Does your large multimodal model achieve human-like mathematical reasoning? Technical re- port, 2024

  48. [48]

    Measuring multi- modal mathematical reasoning with MATH-Vision dataset

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multi- modal mathematical reasoning with MATH-Vision dataset. InNeurIPS, 2024

  49. [49]

    OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific prob- lems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific prob- lems. InACL, 2024

  50. [50]

    MMBench: Is your multi-modal model an all- around player? InECCV, 2024

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. MMBench: Is your multi-modal model an all- around player? InECCV, 2024

  51. [51]

    SEED-Bench: Benchmarking multimodal LLMs with generative comprehension

    BohaoLi,RuiWang,GuangzhiWang,YuyingGe,Yix- iao Ge, and Ying Shan. SEED-Bench: Benchmarking multimodal LLMs with generative comprehension. Technical report, 2023. 11 HG-Bench: Multi-Page Handwritten Answer-Region Grounding

  52. [52]

    MMMU: A massive multi- discipline multimodal understanding and reasoning benchmark for expert AGI

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, et al. MMMU: A massive multi- discipline multimodal understanding and reasoning benchmark for expert AGI. InCVPR, 2024

  53. [53]

    Holis- tic evaluation of language models.TMLR, 2023

    PercyLiang, RishiBommasani, TonyLee, etal. Holis- tic evaluation of language models.TMLR, 2023

  54. [54]

    Beyond the imitation game: Quantifying and extrapolating the capabilities of language models

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. TMLR, 2023

  55. [55]

    A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

    Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

  56. [56]

    box_2d": [100, 200, 180, 300], 4

    Joseph L. Fleiss. Measuring nominal scale agree- ment among many raters.Psychological Bulletin, 76(5):378–382, 1971. 12 HG-Bench: Multi-Page Handwritten Answer-Region Grounding A Annotation Guidelines (Excerpts) This appendix summarizes the core rules used by the annotator pool (Section 4). The full guideline document is released with the benchmark. Skip ...