hub Mixed citations

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang · 2023 · cs.CL · arXiv 2311.16502

Mixed citation behavior. Most common role is background (53%).

47 Pith papers citing it

Background 53% of classified citations

open full Pith review browse 47 citing papers arXiv PDF

abstract

We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. The evaluation of 14 open-source LMMs as well as the proprietary GPT-4V(ision) and Gemini highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V and Gemini Ultra only achieve accuracies of 56% and 59% respectively, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 dataset 5 baseline 1

citation-polarity summary

background 8 use dataset 4 baseline 2 unclear 1

representative citing papers

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

cs.AI · 2026-05-13 · accept · novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

Evaluating Large Language Models in Scientific Discovery

cs.AI · 2025-12-17 · unverdicted · novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.

Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

cs.MM · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.

Allegory of the Cave: Measurement-Grounded Vision-Language Learning

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.

CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs

cs.CV · 2026-05-07 · conditional · novelty 7.0

Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.

ReaLB: Real-Time Load Balancing for Multimodal MoE Inference

cs.DC · 2026-04-21 · unverdicted · novelty 7.0

ReaLB balances multimodal MoE inference loads by switching vision-heavy experts to lower FP4 precision per device rank, hiding the change in the dispatch phase to deliver 1.10-1.32x speedup with <1% accuracy degradation.

PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading

cs.AI · 2026-01-29 · conditional · novelty 7.0

PlotChain benchmark reports top MLLMs reaching ~80% field-level accuracy on engineering plot reading under human-like tolerances, but with persistent failures on frequency-domain tasks like bandpass and FFT spectra.

FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR

cs.CV · 2025-11-19 · unverdicted · novelty 7.0

FinCriticalED benchmark reveals that OCR and MLLM systems frequently fail to preserve critical financial facts such as numbers and monetary units even when lexical accuracy is high.

FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

cs.CV · 2025-04-14 · unverdicted · novelty 7.0

FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.

Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

cs.AI · 2024-10-06 · unverdicted · novelty 7.0

PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.

V-RoAst: Visual Road Assessment. Can VLM be a Road Safety Assessor Using the iRAP Standard?

cs.CV · 2024-08-20 · unverdicted · novelty 7.0

V-RoAst applies zero-shot VLMs (Gemini-1.5-flash, GPT-4o-mini) to iRAP road safety attribute classification on a new ThaiRAP image dataset and compares them to CNN baselines, finding better generalization to unseen classes but weaker spatial reasoning.

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

cs.CV · 2024-07-10 · unverdicted · novelty 7.0

LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

cs.AI · 2024-07-01 · accept · novelty 7.0

WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

cs.CV · 2024-06-13 · conditional · novelty 7.0

MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.

MLVU: Benchmarking Multi-task Long Video Understanding

cs.CV · 2024-06-06 · conditional · novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

cs.CV · 2024-03-21 · conditional · novelty 7.0

MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.

MMTB: Evaluating Terminal Agents on Multimedia-File Tasks

cs.MM · 2026-05-08 · unverdicted · novelty 6.0

MMTB is a new benchmark with 105 multimedia terminal tasks that shows how audio and video access changes agent performance and evidence use in executable workflows.

Co-Evolving Policy Distillation

cs.LG · 2026-04-29 · unverdicted · novelty 6.0

CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific experts on text-image-video reasoning.

Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

cs.AI · 2025-11-26 · unverdicted · novelty 6.0

ViLoMem is a dual-stream grow-and-refine memory system that separates visual and logical error patterns in MLLMs to improve pass@1 accuracy and reduce repeated mistakes across six multimodal benchmarks.

Qwen3-Omni Technical Report

cs.CL · 2025-09-22 · unverdicted · novelty 6.0

Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

cs.CV · 2025-08-25 · unverdicted · novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.

Fourier Compressor: Frequency-Domain Visual Token Compression for Vision-Language Models

cs.CV · 2025-08-08 · unverdicted · novelty 6.0

Fourier Compressor uses FFT to remove frequency-domain redundancy from visual tokens in VLMs, retaining over 96% accuracy with up to 83.8% FLOP reduction.

Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models

cs.CL · 2025-08-06 · unverdicted · novelty 6.0

MedCheck is a lifecycle checklist framework that audits 53 existing medical LLM benchmarks and identifies systemic gaps in clinical fidelity, contamination control, and safety metrics.

Grounded Reinforcement Learning for Visual Reasoning

cs.CV · 2025-05-29 · unverdicted · novelty 6.0

ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.

citing papers explorer

Showing 47 of 47 citing papers.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders cs.AI · 2026-05-13 · accept · none · ref 66 · internal anchor
AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
Evaluating Large Language Models in Scientific Discovery cs.AI · 2025-12-17 · unverdicted · none · ref 43 · internal anchor
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation cs.MM · 2026-05-12 · unverdicted · none · ref 44 · 2 links · internal anchor
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
Allegory of the Cave: Measurement-Grounded Vision-Language Learning cs.AI · 2026-05-12 · unverdicted · none · ref 11 · internal anchor
PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs cs.CV · 2026-05-07 · conditional · none · ref 42 · internal anchor
Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
ReaLB: Real-Time Load Balancing for Multimodal MoE Inference cs.DC · 2026-04-21 · unverdicted · none · ref 30 · internal anchor
ReaLB balances multimodal MoE inference loads by switching vision-heavy experts to lower FP4 precision per device rank, hiding the change in the dispatch phase to deliver 1.10-1.32x speedup with <1% accuracy degradation.
PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading cs.AI · 2026-01-29 · conditional · none · ref 7 · internal anchor
PlotChain benchmark reports top MLLMs reaching ~80% field-level accuracy on engineering plot reading under human-like tolerances, but with persistent failures on frequency-domain tasks like bandpass and FFT spectra.
FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR cs.CV · 2025-11-19 · unverdicted · none · ref 37 · internal anchor
FinCriticalED benchmark reveals that OCR and MLLM systems frequently fail to preserve critical financial facts such as numbers and monetary units even when lexical accuracy is high.
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding cs.CV · 2025-04-14 · unverdicted · none · ref 74 · internal anchor
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark cs.AI · 2024-10-06 · unverdicted · none · ref 48 · internal anchor
PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.
V-RoAst: Visual Road Assessment. Can VLM be a Road Safety Assessor Using the iRAP Standard? cs.CV · 2024-08-20 · unverdicted · none · ref 47 · internal anchor
V-RoAst applies zero-shot VLMs (Gemini-1.5-flash, GPT-4o-mini) to iRAP road safety attribute classification on a new ThaiRAP image dataset and compares them to CNN baselines, finding better generalization to unseen classes but weaker spatial reasoning.
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models cs.CV · 2024-07-10 · unverdicted · none · ref 61 · internal anchor
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? cs.AI · 2024-07-01 · accept · none · ref 54 · internal anchor
WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding cs.CV · 2024-06-13 · conditional · none · ref 67 · internal anchor
MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
MLVU: Benchmarking Multi-task Long Video Understanding cs.CV · 2024-06-06 · conditional · none · ref 57 · internal anchor
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? cs.CV · 2024-03-21 · conditional · none · ref 63 · internal anchor
MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.
MMTB: Evaluating Terminal Agents on Multimedia-File Tasks cs.MM · 2026-05-08 · unverdicted · none · ref 30 · internal anchor
MMTB is a new benchmark with 105 multimedia terminal tasks that shows how audio and video access changes agent performance and evidence use in executable workflows.
Co-Evolving Policy Distillation cs.LG · 2026-04-29 · unverdicted · none · ref 25 · internal anchor
CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific experts on text-image-video reasoning.
Agentic Learner with Grow-and-Refine Multimodal Semantic Memory cs.AI · 2025-11-26 · unverdicted · none · ref 43 · internal anchor
ViLoMem is a dual-stream grow-and-refine memory system that separates visual and logical error patterns in MLLMs to improve pass@1 accuracy and reduce repeated mistakes across six multimodal benchmarks.
Qwen3-Omni Technical Report cs.CL · 2025-09-22 · unverdicted · none · ref 30 · internal anchor
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency cs.CV · 2025-08-25 · unverdicted · none · ref 171 · internal anchor
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
Fourier Compressor: Frequency-Domain Visual Token Compression for Vision-Language Models cs.CV · 2025-08-08 · unverdicted · none · ref 19 · internal anchor
Fourier Compressor uses FFT to remove frequency-domain redundancy from visual tokens in VLMs, retaining over 96% accuracy with up to 83.8% FLOP reduction.
Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models cs.CL · 2025-08-06 · unverdicted · none · ref 7 · internal anchor
MedCheck is a lifecycle checklist framework that audits 53 existing medical LLM benchmarks and identifies systemic gaps in clinical fidelity, contamination control, and safety metrics.
Grounded Reinforcement Learning for Visual Reasoning cs.CV · 2025-05-29 · unverdicted · none · ref 84 · internal anchor
ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models cs.CV · 2025-04-14 · conditional · none · ref 142 · internal anchor
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling cs.CV · 2024-12-06 · unverdicted · none · ref 290 · internal anchor
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Capabilities of Gemini Models in Medicine cs.AI · 2024-04-29 · unverdicted · none · ref 143 · internal anchor
Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.
BLINK: Multimodal Large Language Models Can See but Not Perceive cs.CV · 2024-04-18 · accept · none · ref 87 · internal anchor
BLINK benchmark shows multimodal LLMs reach only 45-51 percent accuracy on core visual perception tasks where humans achieve 95 percent, indicating these abilities have not emerged.
Are We on the Right Way for Evaluating Large Vision-Language Models? cs.CV · 2024-03-29 · conditional · none · ref 51 · internal anchor
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6 capabilities and 18 axes with new metrics for leakage and true multi-modal gain.
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training cs.CV · 2024-03-14 · unverdicted · none · ref 128 · internal anchor
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset cs.CV · 2024-02-22 · accept · none · ref 4 · internal anchor
MATH-Vision is a new benchmark of 3,040 visual mathematical competition problems that reveals substantial gaps between large multimodal models and human performance in mathematical reasoning.
ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models cs.CL · 2024-02-18 · unverdicted · none · ref 135 · internal anchor
ALLaVA creates 1.3M GPT4V-synthesized samples enabling 4B VLMs to achieve competitive results on 17 benchmarks and match 7B/13B models on some tasks.
GPT-4V(ision) is a Generalist Web Agent, if Grounded cs.IR · 2024-01-03 · conditional · none · ref 26 · internal anchor
GPT-4V achieves 51.1% success on live web tasks as a generalist agent when plans are manually grounded, outperforming text-only models, but automatic grounding lags far behind oracle performance.
CogVLM: Visual Expert for Pretrained Language Models cs.CV · 2023-11-06 · conditional · none · ref 28 · internal anchor
CogVLM adds a trainable visual expert inside frozen language model layers for deep vision-language fusion and reports state-of-the-art results on ten cross-modal benchmarks while preserving NLP performance.
AlphaEval: Evaluating Agents in Production cs.CL · 2026-04-14 · unverdicted · none · ref 11 · internal anchor
AlphaEval is a benchmark of 94 production-sourced tasks from seven companies for evaluating full AI agent products across six domains using multiple judgment methods, plus a framework to build similar benchmarks.
Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models cs.AI · 2026-04-13 · unverdicted · none · ref 16 · internal anchor
Newer LLM backbones in VLMs do not always improve performance; gains are task-dependent, with VQA models solving different questions due to better confidence calibration and stable representations.
AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning cs.CV · 2025-11-19 · unverdicted · none · ref 14 · internal anchor
AVATAAR reports relative gains of 5-8% over baseline on CinePile benchmark categories through agentic feedback for long video QA.
Self-Rewarding Vision-Language Model via Reasoning Decomposition cs.CV · 2025-08-27 · unverdicted · none · ref 26 · internal anchor
Vision SR1 decomposes VLM reasoning into visual and language components and uses internal self-rewards to improve visual reasoning and reduce hallucinations more efficiently than external-supervision methods.
Qwen2.5-Omni Technical Report cs.CL · 2025-03-26 · conditional · none · ref 40 · internal anchor
Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text performance on reasoning benchmarks.
Qwen2.5-VL Technical Report cs.CV · 2025-02-19 · unverdicted · none · ref 38 · internal anchor
Qwen2.5-VL reports a vision-language model family using native dynamic-resolution ViT and absolute time encoding that matches GPT-4o on document and diagram tasks while supporting hour-long videos with second-level localization.
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output cs.CV · 2024-07-03 · conditional · none · ref 167 · internal anchor
InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model cs.CV · 2024-01-29 · unverdicted · none · ref 92 · internal anchor
InternLM-XComposer2 introduces Partial LoRA on InternLM2-7B to enable high-quality free-form text-image composition while matching or exceeding GPT-4V on select vision-language benchmarks.
UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning cs.CV · 2026-05-05 · unverdicted · none · ref 32 · internal anchor
UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites cs.CV · 2024-04-25 · unverdicted · none · ref 135 · internal anchor
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
A Survey on Multimodal Large Language Models cs.CV · 2023-06-23 · accept · none · ref 130 · internal anchor
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
A Brief Overview: On-Policy Self-Distillation In Large Language Models cs.HC · 2026-05-18 · unverdicted · none · ref 55 · 2 links · internal anchor
This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.
To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs cs.CV · 2026-03-19 · unreviewed · ref 34 · internal anchor

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer