pith. sign in

arxiv: 2605.19307 · v1 · pith:XMBQB5SInew · submitted 2026-05-19 · 💻 cs.CV

MetaRA: Metamorphic Robustness Assessment for Multimodal Large Language Model-based Visual Question Answering Systems

Pith reviewed 2026-05-20 07:05 UTC · model grok-4.3

classification 💻 cs.CV
keywords Metamorphic TestingRobustness EvaluationVisual Question AnsweringMultimodal Large Language ModelsFailure Mode AnalysisConsistency CheckingInput Transformation
0
0 comments X

The pith

MetaRA applies metamorphic relations to input variations in order to surface robustness failures in MLLM-based VQA systems that accuracy scores leave undetected.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a testing framework that borrows metamorphic relations from software testing to create controlled changes to both images and questions. It then runs these changed inputs through several multimodal models and checks whether the answers remain consistent. Standard accuracy numbers on fixed datasets miss many of the inconsistencies that appear once the inputs are varied. The experiments show distinct failure patterns such as over-reliance on wording or on superficial image features. If these relations hold as reliable probes, developers gain a repeatable way to diagnose weaknesses before deployment.

Core claim

MetaRA generates new test cases by applying predefined metamorphic relations to original image-question pairs, then measures whether the model produces logically consistent answers across the original and transformed inputs. When applied to multiple MLLM-based VQA systems, the method uncovers sensitivities to linguistic rephrasing, dependence on irrelevant visual details, and breakdowns in cross-modal reasoning that remain invisible under conventional accuracy metrics on static benchmarks.

What carries the argument

Metamorphic Relations that systematically transform image-question pairs while preserving expected answer consistency, used to generate and evaluate new test cases.

If this is right

  • Models exhibit measurable sensitivity to small linguistic changes even when the underlying visual content is unchanged.
  • Performance drops when questions target relations between objects rather than single-object attributes.
  • Over-reliance on background or low-level visual statistics becomes detectable once inputs are varied.
  • The same framework can be reused across different VQA tasks without retraining or modifying the models under test.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If metamorphic relations can be automatically learned rather than hand-written, the approach could scale to new domains with less manual effort.
  • The diagnostic patterns might be used to guide targeted data augmentation during fine-tuning to reduce the observed weaknesses.
  • Similar relations could be defined for other multimodal tasks such as image captioning or visual reasoning benchmarks.

Load-bearing premise

The selected metamorphic relations must be valid and representative probes that do not themselves create artificial inconsistencies or biases in the observed failure patterns.

What would settle it

Run MetaRA on a model whose answers remain fully consistent across all chosen relations; if the set of detected failures is no larger than the set found by standard accuracy evaluation on the same original inputs, the claim that MetaRA supplies richer diagnostics would be contradicted.

Figures

Figures reproduced from arXiv: 2605.19307 by Chia-Wen Lin, Ling Zhou, Quanxing Xu, Rubing Huang, Xian Zhong, Xiaohua Huang, Yuhao Tian.

Figure 1
Figure 1. Figure 1: Comparison of robustness evaluation paradigms for VQA. Existing approaches focus on isolated aspects, including (a) language-bias evaluation and (b) image-understanding evaluation, whereas the proposed method (c) performs a systematic assessment by constructing multiple Metamorphic Relations (MRs) to probe diverse model behaviors. V ISION-language tasks serve as representative benchmarks for evaluating mod… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the MetaRA framework. The framework consists of four stages: Input Understanding, Case Generation, Inference Testing, and Failure Detection, which respectively estimate input difficulty, construct transformation ensembles, evaluate models via systematic testing, and analyze potential defects. ground-truth outputs are difficult to define across heteroge￾neous inputs such as text and images. Inst… view at source ↗
Figure 5
Figure 5. Figure 5: FR comparison and ranking consistency between MetaVQA (a) and MetaRA (b) on E-VQA and TEXTVQA. Note that lower FR indicates stronger robustness. The dashed lines show the perfectly consistent ranking trend. Even recent models, such as MM1.5 [43], maintain relatively high FR. This behavior is mainly attributed to the increased difficulty of OCR-VQA, in which the mirror-flip operation in MR2 significantly di… view at source ↗
Figure 4
Figure 4. Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: FR distribution comparison between KBVQA and OCR-VQA tasks. OCR-VQA exhibits a wider distribution range and higher overall FR, indicating increased difficulty in text-intensive multimodal reasoning. TABLE VI: Average FR across four MRs for KBVQA and OCR-VQA. The reported values are obtained by averaging FR of each MR over all corresponding datasets, highlighting differences in robustness across testing tar… view at source ↗
Figure 8
Figure 8. Figure 8: MR-specific robustness of models on KBVQA and OCR-VQA tasks. Note that a darker color indicates a higher FR, implying worse robustness. What year was the vehicle in the picture invented? What is the title of this book? In which year was the vehicle depicted in the picture invented? MR What’s the title written on this book? MR A: Golden retriever GT: Golden retriever A: TNOMREV LUFITUAEB GT: Vermont Beautif… view at source ↗
Figure 9
Figure 9. Figure 9: Examples of metamorphic mutations in KBVQA and OCR-VQA QI pairs. The comparison illustrates how different mutation strategies affect visual-textual consistency across the two task types. KBVQA (9.1%), demonstrating that models’ robustness varies significantly with task type, with OCR-VQA presenting a more challenging robustness evaluation scenario. The models’ robustness across both tasks and all four MRs … view at source ↗
read the original abstract

Visual Question Answering (VQA), as the representative multimodal task, serves as a key benchmark for evaluating the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, existing evaluations largely rely on static datasets and accuracy-based metrics, which fail to capture robustness, consistency, and generalization. Inspired by Metamorphic Testing (MT), we propose Metamorphic Robustness Assessment (MetaRA), a testing framework that employs Metamorphic Relations (MRs) to systematically probe vulnerabilities in MLLM-based VQA systems. MetaRA generates controlled variations of image-question inputs based on specific MRs and evaluates models across diverse conditions. Applying MetaRA to multiple MLLM-based VQA models across different tasks reveals nuanced failure patterns, including sensitivity to linguistic perturbations, over-reliance on superficial visual cues, and deeper weaknesses in multimodal reasoning. Experimental results demonstrate that MetaRA provides richer diagnostic insights than conventional accuracy metrics, exposing failure modes that remain hidden under standard benchmarks. Overall, this work highlights the need for systematic robustness evaluation in VQA and positions metamorphic assessment as a scalable, model-agnostic approach toward trustworthy multimodal AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes MetaRA, a framework inspired by metamorphic testing, to assess robustness in MLLM-based VQA systems. It defines metamorphic relations (MRs) to generate controlled variations of image-question inputs, applies them to multiple models across tasks, and reports that this exposes nuanced failure patterns (e.g., sensitivity to linguistic perturbations, over-reliance on superficial visual cues, and multimodal reasoning weaknesses) that standard accuracy metrics on static datasets miss. The central claim is that MetaRA yields richer diagnostic insights and is a scalable, model-agnostic approach for trustworthy multimodal AI.

Significance. If the experimental demonstration holds, the work supplies a practical, relation-driven testing method that addresses a recognized gap in VQA evaluation—namely, the inability of accuracy-only benchmarks to reveal consistency or generalization failures. The model-agnostic framing and focus on controlled input variations are positive features that could be adopted more broadly if the chosen MRs prove reliable.

major comments (2)
  1. [§4] §4 (Experimental Results): The manuscript asserts that MetaRA reveals failure modes hidden under standard benchmarks, yet the reported results lack quantitative comparison (e.g., number or rate of additional failures detected versus accuracy baselines) or statistical validation of the observed patterns. This weakens the load-bearing claim that the framework provides 'richer diagnostic insights.'
  2. [§3.2] §3.2 (Metamorphic Relations): The validity of the selected MRs as unbiased probes is assumed rather than demonstrated; without explicit checks that each transformation preserves the ground-truth answer or does not introduce its own visual/linguistic artifacts, the reported failure patterns risk being confounded by the MRs themselves.
minor comments (3)
  1. [Figure 2, Table 1] Figure 2 and Table 1: axis labels and legend entries are too small for readability; increase font size and add a caption that explicitly links each MR to the failure mode it is intended to expose.
  2. [§3.3] Notation: the definition of 'robustness score' in §3.3 is introduced without a formal equation; adding Eq. (X) would clarify how pass/fail counts are aggregated across MRs.
  3. [§2.2] Related work: the discussion of prior metamorphic testing applications in NLP (§2.2) omits recent VQA-specific robustness papers (e.g., on adversarial perturbations); a brief comparison would strengthen positioning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address each major comment point by point below, with plans to strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Results): The manuscript asserts that MetaRA reveals failure modes hidden under standard benchmarks, yet the reported results lack quantitative comparison (e.g., number or rate of additional failures detected versus accuracy baselines) or statistical validation of the observed patterns. This weakens the load-bearing claim that the framework provides 'richer diagnostic insights.'

    Authors: We agree that the current presentation would benefit from explicit quantitative comparisons and statistical support. In the revised version, we will add tables and figures that report the number and rate of additional failures uncovered by MetaRA relative to standard accuracy baselines across the evaluated models and tasks. We will also include statistical validation (e.g., paired significance tests or confidence intervals) on the observed differences in failure patterns to substantiate the claim of richer diagnostic insights. revision: yes

  2. Referee: [§3.2] §3.2 (Metamorphic Relations): The validity of the selected MRs as unbiased probes is assumed rather than demonstrated; without explicit checks that each transformation preserves the ground-truth answer or does not introduce its own visual/linguistic artifacts, the reported failure patterns risk being confounded by the MRs themselves.

    Authors: We acknowledge the value of explicit validation. Although the MRs were constructed following metamorphic testing conventions to preserve semantic equivalence (e.g., linguistic paraphrases that retain the original question intent and visual perturbations that do not alter the answer-relevant content), we agree that documenting this more rigorously would address potential concerns. In the revision, we will add a dedicated subsection with validation procedures, including manual inspection of a sample of transformed instances and automated checks confirming that ground-truth answers remain unchanged and no confounding artifacts are introduced. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes MetaRA as an independent metamorphic testing framework that applies chosen relations to generate controlled input variations and then reports observed failure patterns in MLLM-based VQA models. The central claims rest on experimental outcomes rather than any equations or derivations that reduce reported insights to quantities fitted from the same evaluation data. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the derivation chain; the framework definition and its diagnostic results remain separable and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that metamorphic relations can be defined for VQA inputs that preserve semantic intent while exposing robustness gaps. No free parameters or invented entities are described.

axioms (1)
  • domain assumption Metamorphic relations can be formulated for multimodal VQA inputs such that expected output behavior is well-defined and independent of any particular model.
    Invoked when the framework generates controlled variations and interprets deviations as robustness failures.

pith-pipeline@v0.9.0 · 5749 in / 1195 out tokens · 28277 ms · 2026-05-20T07:05:05.612203+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 4 internal anchors

  1. [1]

    StoryLLaV A: Enhancing visual storytelling with multi-modal large language models,

    L. Yang, Z. Xiao, W. Huang, and X. Zhong, “StoryLLaV A: Enhancing visual storytelling with multi-modal large language models,” inProc. Int. Conf. Comput. Linguistics, pp. 3936–3951, 2025

  2. [2]

    Refined semantic enhancement towards frequency diffusion for video captioning,

    X. Zhong, Z. Li, S. Chen, K. Jiang, C. Chen, and M. Ye, “Refined semantic enhancement towards frequency diffusion for video captioning,” inProc. AAAI Conf. Artif. Intell., pp. 3724–3732, 2023

  3. [3]

    Action-aware linguistic skeleton optimization network for non-autoregressive video captioning,

    S. Chen, X. Zhong, Y . Zhang, L. Zhu, P. Li, X. Yang, and B. Sheng, “Action-aware linguistic skeleton optimization network for non-autoregressive video captioning,”ACM Trans. Multimedia Comput. Commun. Appl., vol. 20, no. 10, pp. 326:1–326:24, 2024

  4. [4]

    VQA: Visual question answering,

    S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “VQA: Visual question answering,” inProc. IEEE/CVF Int. Conf. Comput. Vis., pp. 2425–2433, 2015

  5. [5]

    Making the V in VQA matter: Elevating the role of image understanding in visual question answering,

    Y . Goyal, T. Khot, A. Agrawal, D. Summers-Stay, D. Batra, and D. Parikh, “Making the V in VQA matter: Elevating the role of image understanding in visual question answering,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 6904–6913, 2017

  6. [6]

    Robust visual question answering: Datasets, methods, and future challenges,

    J. Ma, P. Wang, D. Kong, Z. Wang, J. Liu, H. Pei, and J. Zhao, “Robust visual question answering: Datasets, methods, and future challenges,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 8, pp. 5575–5594, 2024

  7. [7]

    Metamorphic testing: a new approach for generating next test cases,

    T. Y . Chen, S. Cheung, and S. Yiu, “Metamorphic testing: A new approach for generating next test cases,”arXiv preprint arXiv:2002.12543, 2002

  8. [8]

    KVQA: Knowledge- aware visual question answering,

    S. Shah, A. Mishra, N. Yadati, and P. P. Talukdar, “KVQA: Knowledge- aware visual question answering,” inProc. AAAI Conf. Artif. Intell., pp. 8876–8884, 2019

  9. [9]

    OCR-VQA: Visual question answering by reading text in images,

    A. Mishra, S. Shekhar, A. K. Singh, and A. Chakraborty, “OCR-VQA: Visual question answering by reading text in images,” inProc. Int. Conf. Document Anal. Recognit., pp. 947–952, 2019

  10. [10]

    Metamorphic testing: A review of challenges and opportunities,

    T. Y . Chen, F. Kuo, H. Liu, P. Poon, D. Towey, T. H. Tse, and Z. Q. Zhou, “Metamorphic testing: A review of challenges and opportunities,” ACM Comput. Surv., vol. 51, no. 1, pp. 4:1–4:27, 2018

  11. [11]

    Perception matters: Detecting perception failures of VQA models using metamorphic testing,

    Y . Yuan, S. Wang, M. Jiang, and T. Y . Chen, “Perception matters: Detecting perception failures of VQA models using metamorphic testing,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 16908– 16917, 2021

  12. [12]

    Metamorphic testing of image captioning systems via image-level reduction,

    X. Xie, X. Li, and S. Chen, “Metamorphic testing of image captioning systems via image-level reduction,”IEEE Trans. Softw. Eng., vol. 50, no. 11, pp. 2962–2982, 2024

  13. [13]

    How multi- modal LLMs reshape visual deep learning testing? A comprehensive study through the lens of image mutation,

    L. Wang, Y . Yuan, A. Sun, Z. Li, P. Ma, D. Wu, and S. Wang, “How multi- modal LLMs reshape visual deep learning testing? A comprehensive study through the lens of image mutation,”arXiv preprint arXiv:2404.13945, 2024

  14. [14]

    CLIP in mirror: Disentangling text from visual images through reflection,

    T. Wang, Y . Yang, L. Yang, S. Lin, J. Zhang, G. Guo, and B. Zhang, “CLIP in mirror: Disentangling text from visual images through reflection,” inAdv. Neural Inf. Process. Syst., 2024

  15. [15]

    Order matters: Exploring or- der sensitivity in multimodal large language models,

    Z. Tan, X. Chu, W. Li, and T. Mo, “Order matters: Exploring or- der sensitivity in multimodal large language models,”arXiv preprint arXiv:2410.16983, 2024

  16. [16]

    Improved baselines with visual instruction tuning,

    H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 26286–26296, 2024

  17. [17]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y . Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin, “Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

  18. [18]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang,et al., “Qwen2.5-VL technical report,”arXiv preprint arXiv:2502.13923, 2025

  19. [19]

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    F. Li, R. Zhang, H. Zhang, Y . Zhang, B. Li, W. Li, Z. Ma, and C. Li, “LLaV A-NeXT-InterleaveTackling multi-image, video, and 3D in large multimodal models,”arXiv preprint arXiv:2407.07895, 2024

  20. [20]

    Cross-modal retrieval for knowledge-based visual question answering,

    P. Lerner, O. Ferret, and C. Guinaudeau, “Cross-modal retrieval for knowledge-based visual question answering,” inProc. Eur. Conf. Inf. Retrieval, pp. 421–438, 2024

  21. [21]

    RoRA-VLM: Robust retrieval-augmented vision language models,

    J. Qi, Z. Xu, R. Shao, Y . Chen, J. Di, Y . Cheng, Q. Wang, and L. Huang, “RoRA-VLM: Robust retrieval-augmented vision language models,”arXiv preprint arXiv:2410.08876, 2024

  22. [22]

    EchoSight: Advancing visual-language models with wiki knowledge,

    Y . Yan and W. Xie, “EchoSight: Advancing visual-language models with wiki knowledge,” inFindings Conf. Empirical Methods Natural Lang. Process., pp. 1538–1551, 2024

  23. [23]

    Wiki-LLaV A: Hierarchical retrieval-augmented generation for multimodal LLMs,

    D. Caffagni, F. Cocchi, N. Moratelli, S. Sarto, M. Cornia, L. Baraldi, and R. Cucchiara, “Wiki-LLaV A: Hierarchical retrieval-augmented generation for multimodal LLMs,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops, pp. 1818–1826, 2024

  24. [24]

    Augmenting multimodal LLMs with self-reflective tokens for knowledge- based visual question answering,

    F. Cocchi, N. Moratelli, M. Cornia, L. Baraldi, and R. Cucchiara, “Augmenting multimodal LLMs with self-reflective tokens for knowledge- based visual question answering,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 9199–9209, 2025

  25. [25]

    Fine-grained retrieval- augmented generation for visual question answering,

    Z. Zhang, Y . Wu, Y . Luo, and N. Tang, “Fine-grained retrieval- augmented generation for visual question answering,”arXiv preprint arXiv:2502.20964, 2025

  26. [26]

    MMKB-RAG: A multi-modal knowledge-based retrieval-augmented generation framework,

    Z. Ling, Z. Guo, Y . Huang, Y . An, S. Xiao, J. Lan, X. Zhu, and B. Zheng, “MMKB-RAG: A multi-modal knowledge-based retrieval-augmented generation framework,”arXiv preprint arXiv:2504.10074, 2025

  27. [27]

    Knowledge-based visual question answering with multimodal processing, retrieval, and filtering,

    Y . Hong, J. Gu, Q. Yang, L. Fan, Y . Wu, Y . Wang, K. Ding, S. Xiang, and J. Ye, “Knowledge-based visual question answering with multimodal processing, retrieval, and filtering,”arXiv preprint arXiv:2510.14605, 2025

  28. [28]

    Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories,

    T. Mensink, J. Uijlings, L. Castrejon, A. Goel, F. Cadar, H. Zhou, F. Sha, A. Araujo, and V . Ferrari, “Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories,” inProc. IEEE/CVF Int. Conf. Comput. Vis., pp. 3090–3101, 2023

  29. [29]

    Can pre-trained vision and language models answer visual information- seeking questions?,

    Y . Chen, H. Hu, Y . Luan, H. Sun, S. Changpinyo, A. Ritter, and M. Chang, “Can pre-trained vision and language models answer visual information- seeking questions?,” inProc. Conf. Empirical Methods Natural Lang. Process., pp. 14948–14968, 2023

  30. [30]

    DocVQA: A dataset for VQA on document images,

    M. Mathew, D. Karatzas, and C. V . Jawahar, “DocVQA: A dataset for VQA on document images,” inProc. IEEE/CVF Winter Conf. Appl. Comput. Vis., pp. 2199–2208, 2021

  31. [31]

    InfographicVQA,

    M. Mathew, V . Bagal, R. Tito, D. Karatzas, E. Valveny, and C. V . Jawahar, “InfographicVQA,” inProc. IEEE/CVF Winter Conf. Appl. Comput. Vis., pp. 2582–2591, 2022

  32. [32]

    ChartQA: A benchmark for question answering about charts with visual and logical reasoning,

    A. Masry, D. X. Long, J. Q. Tan, S. R. Joty, and E. Hoque, “ChartQA: A benchmark for question answering about charts with visual and logical reasoning,” inFindings Annu. Meeting Assoc. Comput. Linguistics, pp. 2263–2279, 2022

  33. [33]

    Towards VQA models that can read,

    A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “Towards VQA models that can read,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 8317–8326, 2019

  34. [34]

    UReader: Universal OCR-free visually situated language understanding with multimodal large language model,

    J. Ye, A. Hu, H. Xu, Q. Ye, M. Yan, G. Xu, C. Li, J. Tian, Q. Qian, J. Zhang, Q. Jin, L. He, X. Lin, and F. Huang, “UReader: Universal OCR-free visually situated language understanding with multimodal large language model,” inFindings Conf. Empirical Methods Natural Lang. Process., pp. 2841–2858, 2023

  35. [35]

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, B. Li, P. Luo, T. Lu, Y . Qiao, and J. Dai, “InternVL: Scaling up vision foundation models and aligning for generic visual- linguistic tasks,”arXiv preprint arXiv:2312.14238, 2023

  36. [36]

    mPLUG-DocOwl 1.5: Unified structure learning for OCR-free document understanding,

    A. Hu, H. Xu, J. Ye, M. Yan, L. Zhang, B. Zhang, J. Zhang, Q. Jin, F. Huang, and J. Zhou, “mPLUG-DocOwl 1.5: Unified structure learning for OCR-free document understanding,” inFindings Conf. Empirical Methods Natural Lang. Process., pp. 3096–3120, 2024

  37. [37]

    CogAgent: A visual language model for GUI agents,

    W. Hong, W. Wang, Q. Lv, J. Xu, W. Yu, J. Ji, Y . Wang, Z. Wang, Y . Dong, M. Ding, and J. Tang, “CogAgent: A visual language model for GUI agents,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 14281–14290, 2024

  38. [38]

    Monkey: Image resolution and text label are important things for large multi-modal models,

    Z. Li, B. Yang, Q. Liu, Z. Ma, S. Zhang, J. Yang, Y . Sun, Y . Liu, and X. Bai, “Monkey: Image resolution and text label are important things for large multi-modal models,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 26753–26763, 2024

  39. [39]

    Texthawk: Exploring efficient fine- grained perception of multimodal large language models

    Y . Yu, M. Liao, J. Wu, Y . Liao, X. Zheng, and W. Zeng, “TextHawk: Exploring efficient fine-grained perception of multimodal large language models,”arXiv preprint arXiv:2404.09204, 2024

  40. [40]

    TextHawk2: A large vision- language model excels in bilingual OCR and grounding with 16 × fewer tokens,

    Y . Yu, M. Liao, J. Zhang, and J. Wu, “TextHawk2: A large vision- language model excels in bilingual OCR and grounding with 16 × fewer tokens,”arXiv preprint arXiv:2410.05261, 2024

  41. [41]

    HRVDA: High-resolution visual document assistant,

    C. Liu, K. Yin, H. Cao, X. Jiang, X. Li, Y . Liu, D. Jiang, X. Sun, and L. Xu, “HRVDA: High-resolution visual document assistant,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 15534–15545, 2024

  42. [42]

    Vary: Scaling up the vision vocabulary for large vision- language models,

    H. Wei, L. Kong, J. Chen, L. Zhao, Z. Ge, J. Yang, J. Sun, C. Han, and X. Zhang, “Vary: Scaling up the vision vocabulary for large vision- language models,” inProc. Eur. Conf. Comput. Vis., pp. 408–424, 2024

  43. [43]

    MM1.5: Methods, analysis & insights from multimodal LLM fine-tuning,

    H. Zhang, M. Gao, Z. Gan, P. Dufter, N. Wenzel, F. Huang, D. Shah, X. Du, B. Zhang, Y . Li, S. Dodge, K. You, Z. Yang, A. Timofeev, M. Xu, H. Chen, J. Fauconnier, Z. Lai, H. You, Z. Wang, A. Dehghan, P. Grasch, and Y . Yang, “MM1.5: Methods, analysis & insights from multimodal LLM fine-tuning,” inProc. Int. Conf. Learn. Represent., 2025

  44. [44]

    Marten: Visual question answering with mask generation for multi-modal document understanding,

    Z. Wang, T. Guan, P. Fu, C. Duan, Q. Jiang, Z. Guo, S. Guo, J. Luo, W. Shen, and X. Yang, “Marten: Visual question answering with mask generation for multi-modal document understanding,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 14460–14471, 2025. IEEE TRANSACTIONS ON MULTIMEDIA, 2026 1 SUPPLEMENTARYMATERIALS FOR MetaRA: Metamorphic Robus...