MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
hub
SEED-Bench-2-Plus: Bench- 12 marking multimodal large language models with text-rich vi- sual comprehension
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
dataset 4polarities
use dataset 4representative citing papers
HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.
DiM3 merges multilingual and multimodal model updates in a direction- and magnitude-aware way to enhance multilingual performance in vision-language models while preserving original multimodal abilities.
Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.
DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
An inference-time technique that uses token activation dynamics to adaptively restrict text attention to important visual tokens, improving VLM accuracy on VQA, grounding, counting, OCR, and hallucination benchmarks.
HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
citing papers explorer
-
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
-
Hybrid Latent Reasoning with Decoupled Policy Optimization
HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.
-
DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging
DiM3 merges multilingual and multimodal model updates in a direction- and magnitude-aware way to enhance multilingual performance in vision-language models while preserving original multimodal abilities.
-
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.
-
DeepEyesV2: Toward Agentic Multimodal Model
DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow
An inference-time technique that uses token activation dynamics to adaptively restrict text attention to important visual tokens, improving VLM accuracy on VQA, grounding, counting, OCR, and hallucination benchmarks.
-
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.