hub Mixed citations

MiMo-VL technical report

Xiaomi LLM-Core Team: Mimo-vl technical report · 2025 · arXiv 2506.03569

Mixed citation behavior. Most common role is background (57%).

30 Pith papers citing it

Background 57% of classified citations

read on arXiv browse 30 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 baseline 3

citation-polarity summary

background 4 baseline 3

representative citing papers

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

LongEgoRefer: A Benchmark for Long-Form Egocentric Video Referring Expression Comprehension

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

LongEgoRefer is a new benchmark of 1,498 referring expressions in 45-minute average egocentric videos that exposes the failure of existing Video REC models on sparse long-form spatio-temporal grounding.

When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding

cs.AI · 2026-06-06 · accept · novelty 7.0

MLLMs fail to detect absent correct answers in video QA tasks across three evaluation settings, defaulting to distractors even with chain-of-thought prompting.

Benchmarking Visual State Tracking in Multimodal Video Understanding

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.

Towards Open-World Referring Expression Comprehension: A Benchmark with Training-free Multi-task Consistency Checker

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

OpenRef benchmark for open-world REC with F1 and N3R metrics and training-free MCC to improve existing models in complex scenarios.

Visual Preference Optimization with Rubric Rewards

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.

VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?

cs.CV · 2026-02-04 · conditional · novelty 7.0

VISTA-Bench shows vision-language models degrade on visualized text in images compared to equivalent pure text, with larger gaps under increased perceptual difficulty.

Q-Fold: Query-Aware Focus-Context Spatio-Temporal Folding for Long Video Understanding

cs.CV · 2026-06-10 · unverdicted · novelty 6.0

Q-Fold is a query-aware spatio-temporal folding technique that constructs heterogeneous focus-context inputs from long videos to improve Video-MLLM performance under fixed visual budgets.

Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues

cs.CL · 2026-06-03 · unverdicted · novelty 6.0

Introduces FFR task, F2RVLM and FFRS models, and MLDR dataset for retrieving coherent multi-modal dialogue fragments, reporting superior performance on single-dialogue and corpus benchmarks.

TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL

cs.AI · 2026-06-01 · unverdicted · novelty 6.0

TRON supplies 520 rule-verifiable online visual reasoning environments across five ability buckets that generate unlimited training instances for RL post-training, yielding consistent gains on ten external multimodal benchmarks for three vision-language models.

Detect in Any Scene: An Agentic Framework for Object Detection with Experience-Aware Reasoning

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

DetAS-X uses an MLLM agent to adaptively compose detection workflows from restoration modules and expert detectors, enhanced by self-evolving experience harvesting, achieving substantial F1 score gains on challenging benchmarks.

FoodMonitor: Benchmarking MLLMs for Explainable Compliance Analysis

cs.CV · 2026-05-23 · unverdicted · novelty 6.0

FoodMonitor benchmark evaluates MLLMs on explainable kitchen compliance analysis using dual-channel annotations and a composite C_score metric, with best model at 0.36.

MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

MLLMs know event timing during prefill via sparse Temporal Grounding Heads but lose it in autoregressive decoding; restricting visual context to the high-attention interval at inference time improves VTG performance on three benchmarks.

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

cs.CV · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

Vision-OPD transfers an MLLM's privileged regional perception to its full-image policy through on-policy token-level self-distillation, yielding competitive results on fine-grained visual benchmarks.

Video-Zero: Self-Evolution Video Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

Video-Zero is an annotation-free Questioner-Solver co-evolution framework that centers self-evolution on temporally localized evidence to improve video VLMs.

Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

Introduces VURB benchmark and VUP-35K dataset to train discriminative and generative video reward models that achieve SOTA performance on VURB and VideoRewardBench.

DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

DiffCap-Bench supplies a diverse IDC benchmark with ten categories and LLM judging grounded in human difference lists to evaluate MLLMs more robustly than prior lexical metrics.

OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.

SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.

MiMo-Embodied: X-Embodied Foundation Model Technical Report

cs.RO · 2025-11-20 · unverdicted · novelty 6.0

MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.

PEER: Unified Process-Outcome Reinforcement Learning for Structured Empathetic Reasoning

cs.CL · 2025-08-13 · unverdicted · novelty 6.0

PEER applies GRPO reinforcement learning with a unified process-outcome reward model to structured empathetic reasoning steps on the SER dataset, yielding gains in empathy, strategy alignment, and human-likeness.

Aloe-Vision: Robust Vision-Language Models for Healthcare

cs.CV · 2026-06-25 · unverdicted · novelty 5.0

Releases open medical LVLMs trained on a quality-filtered multimodal dataset, introduces CareQA-Vision benchmark from exams, reports performance gains over baselines, and flags adversarial vulnerabilities.

citing papers explorer

Showing 25 of 25 citing papers after filters.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos cs.CV · 2026-05-08 · unverdicted · none · ref 41
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations cs.CV · 2026-04-20 · unverdicted · none · ref 45
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
LongEgoRefer: A Benchmark for Long-Form Egocentric Video Referring Expression Comprehension cs.CV · 2026-07-02 · unverdicted · none · ref 56
LongEgoRefer is a new benchmark of 1,498 referring expressions in 45-minute average egocentric videos that exposes the failure of existing Video REC models on sparse long-form spatio-temporal grounding.
Benchmarking Visual State Tracking in Multimodal Video Understanding cs.CV · 2026-06-02 · unverdicted · none · ref 59
VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.
Towards Open-World Referring Expression Comprehension: A Benchmark with Training-free Multi-task Consistency Checker cs.CV · 2026-05-25 · unverdicted · none · ref 37
OpenRef benchmark for open-world REC with F1 and N3R metrics and training-free MCC to improve existing models in complex scenarios.
Visual Preference Optimization with Rubric Rewards cs.CV · 2026-04-14 · unverdicted · none · ref 3
rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.
Q-Fold: Query-Aware Focus-Context Spatio-Temporal Folding for Long Video Understanding cs.CV · 2026-06-10 · unverdicted · none · ref 30
Q-Fold is a query-aware spatio-temporal folding technique that constructs heterogeneous focus-context inputs from long videos to improve Video-MLLM performance under fixed visual budgets.
Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues cs.CL · 2026-06-03 · unverdicted · none · ref 136
Introduces FFR task, F2RVLM and FFRS models, and MLDR dataset for retrieving coherent multi-modal dialogue fragments, reporting superior performance on single-dialogue and corpus benchmarks.
TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL cs.AI · 2026-06-01 · unverdicted · none · ref 43
TRON supplies 520 rule-verifiable online visual reasoning environments across five ability buckets that generate unlimited training instances for RL post-training, yielding consistent gains on ten external multimodal benchmarks for three vision-language models.
Detect in Any Scene: An Agentic Framework for Object Detection with Experience-Aware Reasoning cs.CV · 2026-05-29 · unverdicted · none · ref 14
DetAS-X uses an MLLM agent to adaptively compose detection workflows from restoration modules and expert detectors, enhanced by self-evolving experience harvesting, achieving substantial F1 score gains on challenging benchmarks.
FoodMonitor: Benchmarking MLLMs for Explainable Compliance Analysis cs.CV · 2026-05-23 · unverdicted · none · ref 33
FoodMonitor benchmark evaluates MLLMs on explainable kitchen compliance analysis using dual-channel annotations and a composite C_score metric, with best model at 0.36.
MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues cs.CV · 2026-05-21 · unverdicted · none · ref 29
MLLMs know event timing during prefill via sparse Temporal Grounding Heads but lose it in autoregressive decoding; restricting visual context to the high-attention interval at inference time improves VTG performance on three benchmarks.
Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation cs.CV · 2026-05-18 · unverdicted · none · ref 50 · 2 links
Vision-OPD transfers an MLLM's privileged regional perception to its full-image policy through on-policy token-level self-distillation, yielding competitive results on fine-grained visual benchmarks.
Video-Zero: Self-Evolution Video Understanding cs.CV · 2026-05-14 · unverdicted · none · ref 25
Video-Zero is an annotation-free Questioner-Solver co-evolution framework that centers self-evolution on temporally localized evidence to improve video VLMs.
Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models cs.CV · 2026-05-08 · unverdicted · none · ref 37
Introduces VURB benchmark and VUP-35K dataset to train discriminative and generative video reward models that achieve SOTA performance on VURB and VideoRewardBench.
DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning cs.CV · 2026-05-06 · unverdicted · none · ref 35
DiffCap-Bench supplies a diverse IDC benchmark with ten categories and LLM judging grounded in human difference lists to evaluate MLLMs more robustly than prior lexical metrics.
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model cs.CV · 2026-04-22 · unverdicted · none · ref 69
OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.
SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models cs.CV · 2026-04-22 · unverdicted · none · ref 71
SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.
MiMo-Embodied: X-Embodied Foundation Model Technical Report cs.RO · 2025-11-20 · unverdicted · none · ref 59
MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.
PEER: Unified Process-Outcome Reinforcement Learning for Structured Empathetic Reasoning cs.CL · 2025-08-13 · unverdicted · none · ref 11
PEER applies GRPO reinforcement learning with a unified process-outcome reward model to structured empathetic reasoning steps on the SER dataset, yielding gains in empathy, strategy alignment, and human-likeness.
Aloe-Vision: Robust Vision-Language Models for Healthcare cs.CV · 2026-06-25 · unverdicted · none · ref 23
Releases open medical LVLMs trained on a quality-filtered multimodal dataset, introduces CareQA-Vision benchmark from exams, reports performance gains over baselines, and flags adversarial vulnerabilities.
Foresee-to-Ground: From Predictive Temporal Perception to Evidence-Driven Reasoning for Video Temporal Grounding cs.CV · 2026-05-21 · unverdicted · none · ref 31
F2G improves video temporal grounding accuracy by decoupling event identification from boundary measurement using predictive temporal perception to create citable evidence segments for LLM reasoning.
Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation cs.CV · 2026-05-12 · unverdicted · none · ref 33 · 4 links
Proposes reasoning-prefix masking during VLM distillation to anchor student thinking on visual evidence and improve multimodal reasoning in smaller models.
Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing cs.CV · 2025-09-02 · unverdicted · none · ref 19
Rebalancing designer-painter roles by assigning design to the understanding module via the new DIM dataset yields SOTA image editing performance with a 4.6B model.
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation cs.GR · 2026-05-05 · unverdicted · none · ref 70 · 2 links
JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.

MiMo-VL technical report

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer