DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
hub Mixed citations
MiMo-VL technical report
Mixed citation behavior. Most common role is background (57%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
LongEgoRefer is a new benchmark of 1,498 referring expressions in 45-minute average egocentric videos that exposes the failure of existing Video REC models on sparse long-form spatio-temporal grounding.
MLLMs fail to detect absent correct answers in video QA tasks across three evaluation settings, defaulting to distractors even with chain-of-thought prompting.
VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.
OpenRef benchmark for open-world REC with F1 and N3R metrics and training-free MCC to improve existing models in complex scenarios.
rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.
VISTA-Bench shows vision-language models degrade on visualized text in images compared to equivalent pure text, with larger gaps under increased perceptual difficulty.
Q-Fold is a query-aware spatio-temporal folding technique that constructs heterogeneous focus-context inputs from long videos to improve Video-MLLM performance under fixed visual budgets.
Introduces FFR task, F2RVLM and FFRS models, and MLDR dataset for retrieving coherent multi-modal dialogue fragments, reporting superior performance on single-dialogue and corpus benchmarks.
TRON supplies 520 rule-verifiable online visual reasoning environments across five ability buckets that generate unlimited training instances for RL post-training, yielding consistent gains on ten external multimodal benchmarks for three vision-language models.
DetAS-X uses an MLLM agent to adaptively compose detection workflows from restoration modules and expert detectors, enhanced by self-evolving experience harvesting, achieving substantial F1 score gains on challenging benchmarks.
FoodMonitor benchmark evaluates MLLMs on explainable kitchen compliance analysis using dual-channel annotations and a composite C_score metric, with best model at 0.36.
MLLMs know event timing during prefill via sparse Temporal Grounding Heads but lose it in autoregressive decoding; restricting visual context to the high-attention interval at inference time improves VTG performance on three benchmarks.
Vision-OPD transfers an MLLM's privileged regional perception to its full-image policy through on-policy token-level self-distillation, yielding competitive results on fine-grained visual benchmarks.
Video-Zero is an annotation-free Questioner-Solver co-evolution framework that centers self-evolution on temporally localized evidence to improve video VLMs.
Introduces VURB benchmark and VUP-35K dataset to train discriminative and generative video reward models that achieve SOTA performance on VURB and VideoRewardBench.
DiffCap-Bench supplies a diverse IDC benchmark with ten categories and LLM judging grounded in human difference lists to evaluate MLLMs more robustly than prior lexical metrics.
OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.
SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.
MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.
PEER applies GRPO reinforcement learning with a unified process-outcome reward model to structured empathetic reasoning steps on the SER dataset, yielding gains in empathy, strategy alignment, and human-likeness.
Releases open medical LVLMs trained on a quality-filtered multimodal dataset, introduces CareQA-Vision benchmark from exams, reports performance gains over baselines, and flags adversarial vulnerabilities.
citing papers explorer
-
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
-
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
-
LongEgoRefer: A Benchmark for Long-Form Egocentric Video Referring Expression Comprehension
LongEgoRefer is a new benchmark of 1,498 referring expressions in 45-minute average egocentric videos that exposes the failure of existing Video REC models on sparse long-form spatio-temporal grounding.
-
Benchmarking Visual State Tracking in Multimodal Video Understanding
VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.
-
Towards Open-World Referring Expression Comprehension: A Benchmark with Training-free Multi-task Consistency Checker
OpenRef benchmark for open-world REC with F1 and N3R metrics and training-free MCC to improve existing models in complex scenarios.
-
Visual Preference Optimization with Rubric Rewards
rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.
-
Q-Fold: Query-Aware Focus-Context Spatio-Temporal Folding for Long Video Understanding
Q-Fold is a query-aware spatio-temporal folding technique that constructs heterogeneous focus-context inputs from long videos to improve Video-MLLM performance under fixed visual budgets.
-
Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues
Introduces FFR task, F2RVLM and FFRS models, and MLDR dataset for retrieving coherent multi-modal dialogue fragments, reporting superior performance on single-dialogue and corpus benchmarks.
-
TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL
TRON supplies 520 rule-verifiable online visual reasoning environments across five ability buckets that generate unlimited training instances for RL post-training, yielding consistent gains on ten external multimodal benchmarks for three vision-language models.
-
Detect in Any Scene: An Agentic Framework for Object Detection with Experience-Aware Reasoning
DetAS-X uses an MLLM agent to adaptively compose detection workflows from restoration modules and expert detectors, enhanced by self-evolving experience harvesting, achieving substantial F1 score gains on challenging benchmarks.
-
FoodMonitor: Benchmarking MLLMs for Explainable Compliance Analysis
FoodMonitor benchmark evaluates MLLMs on explainable kitchen compliance analysis using dual-channel annotations and a composite C_score metric, with best model at 0.36.
-
MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues
MLLMs know event timing during prefill via sparse Temporal Grounding Heads but lose it in autoregressive decoding; restricting visual context to the high-attention interval at inference time improves VTG performance on three benchmarks.
-
Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation
Vision-OPD transfers an MLLM's privileged regional perception to its full-image policy through on-policy token-level self-distillation, yielding competitive results on fine-grained visual benchmarks.
-
Video-Zero: Self-Evolution Video Understanding
Video-Zero is an annotation-free Questioner-Solver co-evolution framework that centers self-evolution on temporally localized evidence to improve video VLMs.
-
Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models
Introduces VURB benchmark and VUP-35K dataset to train discriminative and generative video reward models that achieve SOTA performance on VURB and VideoRewardBench.
-
DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning
DiffCap-Bench supplies a diverse IDC benchmark with ten categories and LLM judging grounded in human difference lists to evaluate MLLMs more robustly than prior lexical metrics.
-
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model
OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.
-
SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models
SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.
-
MiMo-Embodied: X-Embodied Foundation Model Technical Report
MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.
-
PEER: Unified Process-Outcome Reinforcement Learning for Structured Empathetic Reasoning
PEER applies GRPO reinforcement learning with a unified process-outcome reward model to structured empathetic reasoning steps on the SER dataset, yielding gains in empathy, strategy alignment, and human-likeness.
-
Aloe-Vision: Robust Vision-Language Models for Healthcare
Releases open medical LVLMs trained on a quality-filtered multimodal dataset, introduces CareQA-Vision benchmark from exams, reports performance gains over baselines, and flags adversarial vulnerabilities.
-
Foresee-to-Ground: From Predictive Temporal Perception to Evidence-Driven Reasoning for Video Temporal Grounding
F2G improves video temporal grounding accuracy by decoupling event identification from boundary measurement using predictive temporal perception to create citable evidence segments for LLM reasoning.
-
Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation
Proposes reasoning-prefix masking during VLM distillation to anchor student thinking on visual evidence and improve multimodal reasoning in smaller models.
-
Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing
Rebalancing designer-painter roles by assigning design to the understanding module via the new DIM dataset yields SOTA image editing performance with a 4.6B model.
-
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.