hub Baseline reference

Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement

Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, Lijuan Wang · 2025 · arXiv 2504.07934

Baseline reference. 50% of citing Pith papers use this work as a benchmark or comparison.

21 Pith papers citing it

Baseline 50% of classified citations

read on arXiv browse 21 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 baseline 3 dataset 1

citation-polarity summary

background 4 baseline 3 use dataset 1

representative citing papers

No Place to Hide: Benchmarking Video Hallucination with Background-Controlled Pairs

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

Introduces VidPair-Halluc benchmark of 1K background-controlled adversarial video pairs and 11K QA pairs generated via PairFlow pipeline to evaluate hallucination in LVMs.

CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

cs.CV · 2026-04-24 · unverdicted · novelty 7.0

CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.

Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models

cs.LG · 2026-04-03 · unverdicted · novelty 7.0

RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.

DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning

cs.CV · 2026-06-06 · unverdicted · novelty 6.0

DyCo-RL improves four RLVR algorithms on seven visual and math reasoning benchmarks by assigning tokens visual or text roles via Fisher-Rao geodesic distance on attention and reweighting advantages by role-alignment score.

AnE: Pushing the Reasoning Frontier of Multimodal LLMs via Anchor Evolution

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

AnE combines Truth Anchor Expansion and Scaffold-Stripping to deliver 10.3% gains on eight multimodal reasoning benchmarks for MLLMs.

Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

CRPO applies counterfactual videos and a cross-branch relation reward in RL post-training to reduce shortcut reliance in Video LLMs, with gains shown on the new DyBench paired benchmark.

Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

AutoTool uses dual-mode RL to let MLLMs adaptively choose tool use or text-only reasoning, reporting 21.8% accuracy gain on V* and 44.9% efficiency gain on POPE versus baselines.

CAVE: A Structured Credit Assignment Approach for Fragmented Visual Evidence Reasoning

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

CAVE is a GRPO-based process-reward method that improves VLMs on fragmented visual reasoning by crediting intermediate actions via belief update, evidence acquisition, and adaptive focus, shown on TRACER-Bench and public benchmarks.

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

cs.CV · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.

SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.

Building a Precise Video Language with Human-AI Oversight

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.

DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.

Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models

cs.CV · 2026-03-15 · unverdicted · novelty 6.0

Attention dispersion during extended reasoning impairs MLLM perception on images, and a training-free VRGA framework mitigates it by selecting and reweighting visual attention heads using an entropy-focus criterion.

DeepEyesV2: Toward Agentic Multimodal Model

cs.CV · 2025-11-07 · unverdicted · novelty 6.0

DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.

Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

cs.CV · 2025-05-21 · unverdicted · novelty 6.0

Chain-of-Focus enables VLMs to adaptively search and zoom on important image areas via a two-stage SFT and RL pipeline on a custom 3K-sample dataset, yielding 5% gains on the V* benchmark across resolutions from 224 to 4K.

OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

cs.CV · 2025-03-21 · conditional · novelty 6.0

Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.

Perceptual Flow Network for Visually Grounded Reasoning

cs.CV · 2026-05-04 · unverdicted · novelty 5.0

PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).

Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

cs.LG · 2026-04-23 · unverdicted · novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.

Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

cs.CV · 2026-04-09 · unverdicted · novelty 5.0

HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.

Smart Picks in the Dark: Towards Efficient RLVR for Reasoning via Tracing Metacognitive Pivots

cs.LG · 2026-06-03 · unverdicted · novelty 4.0

PivotTrace selects unlabeled data for RLVR by quantifying uncertainty via pivot density from attention dynamics, outperforming full supervision using only 29.3% annotations and converging 2.75 times faster.

Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models

cs.AI · 2026-04-11

citing papers explorer

Showing 15 of 15 citing papers after filters.

No Place to Hide: Benchmarking Video Hallucination with Background-Controlled Pairs cs.CV · 2026-06-30 · unverdicted · none · ref 70
Introduces VidPair-Halluc benchmark of 1K background-controlled adversarial video pairs and 11K QA pairs generated via PairFlow pipeline to evaluate hallucination in LVMs.
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding cs.CV · 2026-04-24 · unverdicted · none · ref 48
CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.
DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning cs.CV · 2026-06-06 · unverdicted · none · ref 28
DyCo-RL improves four RLVR algorithms on seven visual and math reasoning benchmarks by assigning tokens visual or text roles via Fisher-Rao geodesic distance on attention and reweighting advantages by role-alignment score.
AnE: Pushing the Reasoning Frontier of Multimodal LLMs via Anchor Evolution cs.CV · 2026-05-25 · unverdicted · none · ref 43
AnE combines Truth Anchor Expansion and Scaffold-Stripping to deliver 10.3% gains on eight multimodal reasoning benchmarks for MLLMs.
Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning cs.CV · 2026-05-21 · unverdicted · none · ref 43
CRPO applies counterfactual videos and a cross-branch relation reward in RL post-training to reduce shortcut reliance in Video LLMs, with gains shown on the new DyBench paired benchmark.
CAVE: A Structured Credit Assignment Approach for Fragmented Visual Evidence Reasoning cs.CV · 2026-05-13 · unverdicted · none · ref 42
CAVE is a GRPO-based process-reward method that improves VLMs on fragmented visual reasoning by crediting intermediate actions via belief update, evidence acquisition, and adaptive focus, shown on TRACER-Bench and public benchmarks.
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs cs.CV · 2026-05-01 · unverdicted · none · ref 79 · 2 links
PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models cs.CV · 2026-04-22 · unverdicted · none · ref 66
SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.
Building a Precise Video Language with Human-AI Oversight cs.CV · 2026-04-22 · unverdicted · none · ref 65
CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.
DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents cs.CV · 2026-04-21 · unverdicted · none · ref 74
DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.
Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models cs.CV · 2026-03-15 · unverdicted · none · ref 26
Attention dispersion during extended reasoning impairs MLLM perception on images, and a training-free VRGA framework mitigates it by selecting and reweighting visual attention heads using an entropy-focus criterion.
DeepEyesV2: Toward Agentic Multimodal Model cs.CV · 2025-11-07 · unverdicted · none · ref 51
DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs cs.CV · 2025-05-21 · unverdicted · none · ref 42
Chain-of-Focus enables VLMs to adaptively search and zoom on important image areas via a two-stage SFT and RL pipeline on a custom 3K-sample dataset, yielding 5% gains on the V* benchmark across resolutions from 224 to 4K.
Perceptual Flow Network for Visually Grounded Reasoning cs.CV · 2026-05-04 · unverdicted · none · ref 55
PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models cs.CV · 2026-04-09 · unverdicted · none · ref 38
HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.

Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer