VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence
Pith reviewed 2026-05-21 05:38 UTC · model grok-4.3
The pith
VISTAQA requires models to answer visual questions correctly and supply matching pixel-level evidence masks, with GROVE enforcing that neither can compensate for the other.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VISTAQA comprises 1,157 expert-curated samples that demand both correct free-form answers and precise supporting segmentation masks; GROVE combines the two scores per sample via geometric mean so that strong performance on one dimension cannot offset weakness in the other; under this protocol the strongest tested systems still achieve only limited joint scores, exposing a gap between answer accuracy and pixel-level evidence alignment.
What carries the argument
VISTAQA benchmark paired with the GROVE metric, which multiplies textual accuracy and grounding quality under a geometric mean per sample.
If this is right
- Evaluation protocols for multimodal models must require explicit visual evidence rather than accepting text answers alone.
- Model architectures will need mechanisms that couple reasoning steps directly to pixel selection.
- Hallucination-aware test cases become necessary to penalize answers without any valid visual support.
- Training objectives should optimize the joint score rather than separate accuracy and localization losses.
Where Pith is reading between the lines
- Training regimes that back-propagate through both answer and mask heads may close the observed gap faster than post-hoc grounding modules.
- The same joint-evaluation pattern could be applied to other multimodal tasks such as visual reasoning or captioning to surface similar misalignments.
- If the gap persists across larger models, it suggests that scale alone does not automatically produce grounded reasoning.
Load-bearing premise
The 1,157 expert-curated samples and the geometric-mean formulation of GROVE together give an unbiased measure of joint reasoning and grounding ability.
What would settle it
A model that scores high on both individual answer accuracy and mask overlap yet still produces answers unsupported by its own masks on a new held-out set of similar size and diversity.
Figures
read the original abstract
Establishing a clear link between model predictions and the visual evidence that supports them is critical for transparency and reliability in multimodal reasoning, yet current multimodal large language model (MLLM) evaluations do not explicitly enforce this alignment. Existing benchmarks assess either textual answer correctness or pixel-level localization in isolation, leaving the coupling of reasoning and grounding an open challenge. We introduce VISTAQA, a comprehensive benchmark for joint evaluation of free-form answer correctness and pixel-level evidence grounding in visual question answering. VISTAQA comprises 1,157 expert-curated samples spanning six task types and six visual domains, ranging from direct perception to compositional and relational reasoning. VISTAQA requires models to not only answer correctly, but to also provide precise segmentation masks that support their answers. It also includes hallucination-aware examples where no valid visual evidence exists. To support this enhanced evaluation, we introduce GROVE, a unified evaluation metric that enforces joint correctness by combining textual accuracy and grounding quality via a per-sample geometric mean, ensuring neither dimension can compensate for deficiencies in the other. Comprehensive experiments across grounding-aware models and hybrid pipelines with general-purpose MLLMs reveal that even the strongest systems achieve limited performance under GROVE, highlighting a substantial gap between answer accuracy and visual evidence alignment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces VISTAQA, a benchmark with 1,157 expert-curated samples spanning six task types and six visual domains for joint evaluation of free-form VQA answer correctness and pixel-level segmentation mask grounding. It defines GROVE as the per-sample geometric mean of textual accuracy and grounding quality to prevent compensation between the two dimensions, and reports experiments on grounding-aware models and MLLM pipelines showing limited performance and a substantial gap between answer accuracy and visual evidence alignment. Hallucination-aware examples where no valid evidence exists are also included.
Significance. If the benchmark curation and metric hold up under scrutiny, the work provides a useful tool for measuring integrated reasoning-grounding capability in multimodal models, an area where isolated accuracy or localization metrics fall short. The geometric-mean formulation and expert curation are clear strengths that could drive progress toward more transparent and reliable MLLMs.
major comments (2)
- [§3.2 (GROVE definition)] §3.2 (GROVE definition): the geometric mean enforces that neither accuracy nor grounding can compensate, but this is only valid if the two scores are commensurable. The manuscript provides no details on the grounding quality function (IoU threshold, pixel-wise F1, or mask overlap), its exact normalization to [0,1], or any calibration to match the scale and variance of accuracy scores. Without this, systematic differences in grounding score distributions could artifactually depress GROVE values and exaggerate the reported gap, directly undermining the central claim.
- [§5 (Experiments)] §5 (Experiments): the reported performance gaps rest on model outputs and scoring, yet the text gives no specifics on exact model implementations, the precise procedure for computing textual accuracy on free-form answers, the grounding quality formula, inter-annotator agreement for the 1,157 masks, or statistical significance tests. These omissions make it impossible to reproduce or verify the 'limited performance' results that support the gap conclusion.
minor comments (2)
- [Abstract] The abstract states 'six task types and six visual domains' without enumerating them; adding a short list or reference to Table 1 would improve immediate clarity.
- [§3.2] Notation for the two components of GROVE (accuracy and grounding quality) should be introduced with explicit symbols and ranges in the metric section to avoid ambiguity when reading results tables.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating revisions to enhance reproducibility and address concerns about the metric and experimental details.
read point-by-point responses
-
Referee: [§3.2 (GROVE definition)] §3.2 (GROVE definition): the geometric mean enforces that neither accuracy nor grounding can compensate, but this is only valid if the two scores are commensurable. The manuscript provides no details on the grounding quality function (IoU threshold, pixel-wise F1, or mask overlap), its exact normalization to [0,1], or any calibration to match the scale and variance of accuracy scores. Without this, systematic differences in grounding score distributions could artifactually depress GROVE values and exaggerate the reported gap, directly undermining the central claim.
Authors: We agree that explicit details on the grounding quality function and normalization are required to justify the commensurability assumption underlying the geometric mean. The original manuscript was insufficiently precise on this point. In the revised version, §3.2 now specifies that grounding quality is computed as the pixel-wise F1 score between the predicted and ground-truth masks (with a 0.5 overlap threshold for positive pixels), normalized to [0,1]. We further add a calibration step that z-score normalizes both textual accuracy and grounding scores using empirical means and standard deviations from a held-out validation subset, ensuring comparable scales and variances. This revision directly mitigates the risk of distributional artifacts affecting GROVE. revision: yes
-
Referee: [§5 (Experiments)] §5 (Experiments): the reported performance gaps rest on model outputs and scoring, yet the text gives no specifics on exact model implementations, the precise procedure for computing textual accuracy on free-form answers, the grounding quality formula, inter-annotator agreement for the 1,157 masks, or statistical significance tests. These omissions make it impossible to reproduce or verify the 'limited performance' results that support the gap conclusion.
Authors: We concur that the experimental section lacked sufficient implementation and procedural details for full reproducibility. The revised manuscript expands §5 with: exact model versions and hyperparameters for the grounding-aware models and MLLM pipelines; textual accuracy computed via normalized exact string match supplemented by cosine similarity on sentence embeddings for free-form answers; the grounding quality formula now detailed in §3.2; inter-annotator agreement of 0.84 mean IoU across the 1,157 expert masks; and statistical significance assessed via paired bootstrap tests (10,000 resamples) confirming p < 0.01 for the reported gaps. These additions enable verification of the limited performance and gap findings. revision: yes
Circularity Check
No circularity in VISTAQA benchmark or GROVE metric definitions
full rationale
The paper introduces VISTAQA as an expert-curated benchmark of 1,157 samples and defines GROVE explicitly as the per-sample geometric mean of two independently measured quantities (textual answer accuracy and grounding quality). No derivation, prediction, or first-principles result is claimed that reduces by construction to fitted parameters, self-referential quantities, or self-citation chains. The central claims about performance gaps are empirical evaluations on the new benchmark rather than tautological constructions, and the metric is presented as a direct definitional choice to enforce joint correctness without any load-bearing self-citations or imported uniqueness theorems.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert-curated samples accurately reflect the coupling of textual reasoning and visual grounding across the chosen domains and task types.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VISTAQA ... 1,157 expert-curated samples ... joint evaluation of free-form answer correctness and pixel-level evidence grounding
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
VQA: Visual question answering
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. InProceedings of the IEEE International Conference on Computer Vision, pages 2425–2433, 2015
work page 2015
-
[2]
nuScenes: A multimodal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11621– 11631, 2020
work page 2020
-
[3]
SAM 3: Segment Anything with Concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. SAM 3: Segment anything with concepts. arXiv: 2511.16719, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Grounding answers for visual questions asked by visually impaired people
Chongyan Chen, Samreen Anjum, and Danna Gurari. Grounding answers for visual questions asked by visually impaired people. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19098–19107, 2022
work page 2022
-
[5]
Eagle 2.5: Boosting long-context post-training for frontier vision-language models
Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu Le, Max Ehrlich, Tong Lu, Limin Wang, Bryan Catanzaro, Jan Kautz, Andrew Tao, Zhiding Yu, and Guilin Liu. Eagle 2.5: Boosting long-context post-training for frontier vision-language models. InThe Thirty-ninth Annual Conference on Neural Information ...
work page 2025
-
[6]
InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024
work page 2024
-
[7]
Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960
work page 1960
-
[8]
Evaluating hallucination in large vision-language models based on context-aware object similarities
Shounak Datta and Dhanasekar Sundararaman. Evaluating hallucination in large vision-language models based on context-aware object similarities. arXiv: 2501.15046, 2025
-
[9]
MME: A comprehensive evaluation benchmark for multimodal large language models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji, Caifeng Shan, and Ran He. MME: A comprehensive evaluation benchmark for multimodal large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, Datasets and Benchmarks Track, 2025
work page 2025
-
[10]
Google DeepMind. Gemini 3 Pro Model Card. Technical report, Google DeepMind, February 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf
work page 2025
-
[11]
Google DeepMind. Gemini 3.1 Pro Model Card. Technical report, Google DeepMind, February 2026. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-1-Pro-Model-Card.pdf
work page 2026
-
[12]
Making the V in VQA matter: Elevating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913, 2017
work page 2017
-
[13]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. HallucinoBench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pag...
work page 2024
-
[14]
DROID: A large-scale in-the- wild robot manipulation dataset
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, et al. DROID: A large-scale in-the- wild robot manipulation dataset. InRSS 2024 Workshop: Data Generation for Robotics, 2024. 10
work page 2024
-
[15]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023
work page 2023
-
[16]
Harold W. Kuhn. The Hungarian method for the assignment problem.Naval Research Logistics Quarterly, 2(1-2):83–97, 1955
work page 1955
-
[17]
LISA: Reasoning segmentation via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. LISA: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024
work page 2024
-
[18]
Counterfactual Segmentation Reasoning: Diagnosing and Mitigating Pixel-Grounding Hallucination
Xinzhuo Li, Adheesh Juvekar, Jiaxun Zhang, Xingyou Liu, Muntasir Wahed, Kiet A Nguyen, Yifan Shen, Tianjiao Yu, and Ismini Lourentzou. Counterfactual segmentation reasoning: Diagnosing and mitigating pixel-grounding hallucination. arXiv: 2506.21546, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
PhD: A ChatGPT-prompted visual hallucination evaluation dataset
Jiazhen Liu, Yuhan Fu, Ruobing Xie, Runquan Xie, Xingwu Sun, Fengzong Lian, Zhanhui Kang, and Xirong Li. PhD: A ChatGPT-prompted visual hallucination evaluation dataset. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19857–19866, 2025
work page 2025
-
[20]
UniPixel: Unified object referring and segmentation for pixel-level visual reasoning
Ye Liu, Zongyang Ma, Junfu Pu, Zhongang Qi, Yang Wu, Ying Shan, and Chang Wen Chen. UniPixel: Unified object referring and segmentation for pixel-level visual reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[21]
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. MMBench: Is your multi-modal model an all-around player? In European Conference on Computer Vision, pages 216–233. Springer, 2024
work page 2024
-
[22]
MathVista: Evaluating mathematical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[23]
Generation and comprehension of unambiguous object descriptions
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11–20, 2016
work page 2016
-
[24]
LingoQA: Visual question answering for autonomous driving
Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. LingoQA: Visual question answering for autonomous driving. InEuropean Conference on Computer Vision, pages 252–269. Springer, 2024
work page 2024
-
[25]
Update to GPT-5 System Card: GPT-5.2
OpenAI. Update to GPT-5 System Card: GPT-5.2. Technical report, OpenAI, December
-
[26]
URL https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_ 2_system-card.pdf
-
[27]
OpenAI. GPT-5.4 Thinking System Card. Technical report, OpenAI, March 2026. URL https:// deploymentsafety.openai.com/gpt-5-4-thinking/gpt-5-4-thinking.pdf
work page 2026
-
[28]
UGround: Towards Unified Visual Grounding with Unrolled Transformers
Rui Qian, Xin Yin, Chuanhang Deng, Zhiyuan Peng, Jian Xiong, Wei Zhai, and Dejing Dou. UGround: Towards unified visual grounding with unrolled transformers. arXiv: 2510.03853, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
NuScenes-QA: A multi-modal visual question answering benchmark for autonomous driving scenario
Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. NuScenes-QA: A multi-modal visual question answering benchmark for autonomous driving scenario. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4542–4550, 2024
work page 2024
-
[30]
GLaMM: Pixel grounding large multi- modal model
Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. GLaMM: Pixel grounding large multi- modal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13009–13018, 2024
work page 2024
-
[31]
Conversational image segmentation: Grounding abstract concepts with scalable supervision
Aadarsh Sahoo and Georgia Gkioxari. Conversational image segmentation: Grounding abstract concepts with scalable supervision. arXiv: 2602.13195, 2026
-
[32]
Tanik Saikh, Tirthankar Ghosal, Amish Mittal, Asif Ekbal, and Pushpak Bhattacharyya. ScienceQA: A novel resource for question answering on scholarly articles.International Journal on Digital Libraries, 23 (3):289–301, 2022
work page 2022
-
[33]
DriveLM: Driving with graph visual question answering
Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. DriveLM: Driving with graph visual question answering. In European Conference on Computer Vision, pages 256–274. Springer, 2024. 11
work page 2024
-
[34]
Michael Steele.The Cauchy-Schwarz Master Class
J. Michael Steele.The Cauchy-Schwarz Master Class. Cambridge University Press, 2004
work page 2004
-
[35]
Vision language models are biased
An V o, Khai-Nguyen Nguyen, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, and Daeyoung Kim. Vision language models are biased. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=DG4S2OlGQA
work page 2026
-
[36]
Traceable evidence enhanced visual grounded reasoning: Evaluation and method
Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Sule Bai, Zijian Kang, Jiashi Feng, Wang Zhuochen, et al. Traceable evidence enhanced visual grounded reasoning: Evaluation and method. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[37]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv: 2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Hawaii: Hierarchical visual knowledge transfer for efficient vision-language models
Yimu Wang, Mozhgan Nasr Azadani, Sean Sedwards, and Krzysztof Czarnecki. Hawaii: Hierarchical visual knowledge transfer for efficient vision-language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[39]
LaSagnA: Language-based segmentation assistant for complex queries
Cong Wei, Haoxian Tan, Yujie Zhong, Yujiu Yang, and Lin Ma. LaSagnA: Language-based segmentation assistant for complex queries. arXiv: 2404.08506, 2024
-
[40]
Penghao Wu and Saining Xie.V ∗: Guided visual search as a core mechanism in multimodal LLMs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084– 13094, 2024
work page 2024
-
[41]
See, say, and segment: Teaching LMMs to overcome false premises
Tsung-Han Wu, Giscard Biamby, David Chan, Lisa Dunlap, Ritwik Gupta, Xudong Wang, Joseph E Gonzalez, and Trevor Darrell. See, say, and segment: Teaching LMMs to overcome false premises. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13459– 13469, 2024
work page 2024
-
[42]
VISA: Reasoning video object segmentation via large language models
Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, and Efstratios Gavves. VISA: Reasoning video object segmentation via large language models. InEuropean Conference on Computer Vision, pages 98–115. Springer, 2024
work page 2024
-
[43]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv: 2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Modeling context in referring expressions
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. InEuropean Conference on Computer Vision, pages 69–85. Springer, 2016
work page 2016
-
[45]
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Haobo Yuan, Xiangtai Li, Tao Zhang, Yueyi Sun, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, et al. Sa2V A: Marrying SAM2 with MLLM for dense grounded understanding of images and videos. arXiv: 2501.04001, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Visual reasoning tracer: Object-level grounded reasoning benchmark
Haobo Yuan, Yueyi Sun, Yanwei Li, Tao Zhang, Xueqing Deng, Henghui Ding, Lu Qi, Anran Wang, Xiangtai Li, and Ming-Hsuan Yang. Visual reasoning tracer: Object-level grounded reasoning benchmark. arXiv: 2512.05091, 2025
-
[47]
MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024
work page 2024
-
[48]
MMMU-Pro: A more robust multi-discipline multimodal understanding benchmark
Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. MMMU-Pro: A more robust multi-discipline multimodal understanding benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics., pages 15134–15186, 2025
work page 2025
-
[49]
Omg-llava : Bridging image-level, object-level, pixel-level reasoning and understanding
Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, and Shuicheng Yan. Omg-llava : Bridging image-level, object-level, pixel-level reasoning and understanding. InAdvances in Neural Information Processing Systems, volume 37, pages 71737–71767, 2024
work page 2024
-
[50]
Robust multimodal large language models against modality conflict
Zongmeng Zhang, Wengang Zhou, Jie Zhao, and Houqiang Li. Robust multimodal large language models against modality conflict. InInternational Conference on Machine Learning, pages 77233–77253. PMLR, 2025
work page 2025
-
[51]
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems, volume 36, pages 46595–46623, 2023. 12 A Appendix A.1: Tasks A.2: Additional Results A.3: Choice ofϵ A.4: Groun...
work page 2023
-
[52]
Identificationevaluates the ability to recognize and name objects, entities, or scene elements present in the image. The task requires precise localization of the target and accurate category-level or instance- level recognition, particularly in cluttered or multi-object scenes where discriminative grounding is essential
-
[53]
Attributeevaluates the ability to identify and describe specific properties of objects or regions, including color, shape, texture, and fine-grained appearance characteristics. Success requires attention to subtle visual details and accurate association of properties with the correct image region, particularly when multiple objects share similar features
-
[54]
OCRevaluates the ability to detect, read, and interpret text present in the scene, requiring tight integration of localization and text recognition. This task is particularly challenging in domains where text appears at varying scales, orientations, or under partial occlusion, and where the answer must be grounded in the specific image region containing t...
-
[55]
Spatialevaluates the ability to interpret positional and geometric relationships between objects or regions in the scene, including absolute and relative positions, directional relationships, and proximity. The task requires integrating localization with relational reasoning to produce answers grounded in the correct spatial configuration
-
[56]
Countingevaluates the ability to enumerate instances of a specified category within the image, requiring systematic localization of all relevant regions and aggregation of evidence across the scene. This task is particularly demanding in dense or occluded scenes where individual instances are difficult to discriminate
-
[57]
Comparisonevaluates the ability to reason about differences or similarities between two or more objects, regions, or attributes within the image. Correct answers require localizing relevant region, and performing relational inference over the evidence. A.2 Additional Results Table 5 reports overall text and mask scores per task type. While pipeline models...
-
[58]
Accept minor wording differences, paraphrases, singular/plural variation, and equivalent expressions
-
[59]
Accept answers that are semantically equivalent to the ground-truth answer
-
[60]
Reject answers that are incomplete, overly vague, or refer to the wrong object, attribute, count number, text, or relation
-
[61]
For counting questions, the numeric value must match exactly unless the reference explicitly allows a range
-
[62]
For OCR questions, ignore capitalization and minor punctuation differences, but do not ignore incorrect characters or different words
-
[63]
For hallucination questions, accept only answers that clearly indicate the target does not exist or is not present or the answer cannot be determined from the image
-
[64]
Be strict: the predicted answer should mean the same thing as the ground-truth answer in the context of the question. Return your decision in JSON with the following fields: { "correct": 0 or 1, "reason": "short explanation" } Question: {question} Ground-truth answer: {gt-answer} Predicted answer: {pred-answer} Task type: {task-type} Domain: {domain} A.9 ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.