hub Mixed citations

MiMo-VL technical report

Xiaomi LLM-Core Team: Mimo-vl technical report · 2025 · arXiv 2506.03569

Mixed citation behavior. Most common role is background (57%).

30 Pith papers citing it

Background 57% of classified citations

read on arXiv browse 30 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 baseline 3

citation-polarity summary

background 4 baseline 3

representative citing papers

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

LongEgoRefer: A Benchmark for Long-Form Egocentric Video Referring Expression Comprehension

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

LongEgoRefer is a new benchmark of 1,498 referring expressions in 45-minute average egocentric videos that exposes the failure of existing Video REC models on sparse long-form spatio-temporal grounding.

When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding

cs.AI · 2026-06-06 · accept · novelty 7.0

MLLMs fail to detect absent correct answers in video QA tasks across three evaluation settings, defaulting to distractors even with chain-of-thought prompting.

Benchmarking Visual State Tracking in Multimodal Video Understanding

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.

Towards Open-World Referring Expression Comprehension: A Benchmark with Training-free Multi-task Consistency Checker

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

OpenRef benchmark for open-world REC with F1 and N3R metrics and training-free MCC to improve existing models in complex scenarios.

Visual Preference Optimization with Rubric Rewards

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.

VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?

cs.CV · 2026-02-04 · conditional · novelty 7.0

VISTA-Bench shows vision-language models degrade on visualized text in images compared to equivalent pure text, with larger gaps under increased perceptual difficulty.

Q-Fold: Query-Aware Focus-Context Spatio-Temporal Folding for Long Video Understanding

cs.CV · 2026-06-10 · unverdicted · novelty 6.0

Q-Fold is a query-aware spatio-temporal folding technique that constructs heterogeneous focus-context inputs from long videos to improve Video-MLLM performance under fixed visual budgets.

Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues

cs.CL · 2026-06-03 · unverdicted · novelty 6.0

Introduces FFR task, F2RVLM and FFRS models, and MLDR dataset for retrieving coherent multi-modal dialogue fragments, reporting superior performance on single-dialogue and corpus benchmarks.

TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL

cs.AI · 2026-06-01 · unverdicted · novelty 6.0

TRON supplies 520 rule-verifiable online visual reasoning environments across five ability buckets that generate unlimited training instances for RL post-training, yielding consistent gains on ten external multimodal benchmarks for three vision-language models.

Detect in Any Scene: An Agentic Framework for Object Detection with Experience-Aware Reasoning

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

DetAS-X uses an MLLM agent to adaptively compose detection workflows from restoration modules and expert detectors, enhanced by self-evolving experience harvesting, achieving substantial F1 score gains on challenging benchmarks.

FoodMonitor: Benchmarking MLLMs for Explainable Compliance Analysis

cs.CV · 2026-05-23 · unverdicted · novelty 6.0

FoodMonitor benchmark evaluates MLLMs on explainable kitchen compliance analysis using dual-channel annotations and a composite C_score metric, with best model at 0.36.

MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

MLLMs know event timing during prefill via sparse Temporal Grounding Heads but lose it in autoregressive decoding; restricting visual context to the high-attention interval at inference time improves VTG performance on three benchmarks.

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

cs.CV · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

Vision-OPD transfers an MLLM's privileged regional perception to its full-image policy through on-policy token-level self-distillation, yielding competitive results on fine-grained visual benchmarks.

Video-Zero: Self-Evolution Video Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

Video-Zero is an annotation-free Questioner-Solver co-evolution framework that centers self-evolution on temporally localized evidence to improve video VLMs.

Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

Introduces VURB benchmark and VUP-35K dataset to train discriminative and generative video reward models that achieve SOTA performance on VURB and VideoRewardBench.

DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

DiffCap-Bench supplies a diverse IDC benchmark with ten categories and LLM judging grounded in human difference lists to evaluate MLLMs more robustly than prior lexical metrics.

OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.

SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.

MiMo-Embodied: X-Embodied Foundation Model Technical Report

cs.RO · 2025-11-20 · unverdicted · novelty 6.0

MiMo-Embodied is a single foundation model that achieves state-of-the-art results on 17 embodied AI benchmarks and 12 autonomous driving benchmarks through multi-stage learning, curated data, and CoT/RL fine-tuning that produces positive cross-domain transfer.

PEER: Unified Process-Outcome Reinforcement Learning for Structured Empathetic Reasoning

cs.CL · 2025-08-13 · unverdicted · novelty 6.0

PEER applies GRPO reinforcement learning with a unified process-outcome reward model to structured empathetic reasoning steps on the SER dataset, yielding gains in empathy, strategy alignment, and human-likeness.

Aloe-Vision: Robust Vision-Language Models for Healthcare

cs.CV · 2026-06-25 · unverdicted · novelty 5.0

Releases open medical LVLMs trained on a quality-filtered multimodal dataset, introduces CareQA-Vision benchmark from exams, reports performance gains over baselines, and flags adversarial vulnerabilities.

citing papers explorer

Showing 1 of 1 citing paper after filters.

When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding cs.AI · 2026-06-06 · accept · none · ref 7
MLLMs fail to detect absent correct answers in video QA tasks across three evaluation settings, defaulting to distractors even with chain-of-thought prompting.

MiMo-VL technical report

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer