hub Mixed citations

Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning

Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, Tianfei Zhou · 2025 · arXiv 2505.12434

Mixed citation behavior. Most common role is background (60%).

14 Pith papers citing it

Background 60% of classified citations

read on arXiv browse 14 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 baseline 1

citation-polarity summary

background 3 baseline 1 unclear 1

representative citing papers

Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models

cs.CV · 2026-05-09 · unverdicted · novelty 7.0

STEMO-Bench evaluates intermediate spatio-temporal reasoning in video MLLMs via object-centric facts, and STEMO-Track improves consistency by chunk-wise trajectory construction and aggregation.

Act2See: Emergent Active Visual Perception for Video Reasoning

cs.CV · 2026-05-03 · unverdicted · novelty 7.0

Act2See trains VLMs via supervised fine-tuning on verified reasoning traces to interleave active frame calls within text CoTs, yielding SOTA results on video reasoning benchmarks.

Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench

cs.AI · 2026-04-17 · conditional · novelty 7.0

AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.

VIDEOP2R: Video Understanding from Perception to Reasoning

cs.CV · 2025-11-14 · conditional · novelty 7.0

VideoP2R separates perception and reasoning in a process-aware RFT pipeline with a new CoT dataset and PA-GRPO rewards, reaching SOTA on six of seven video benchmarks.

TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?

cs.CV · 2025-09-19 · unverdicted · novelty 7.0

Introduces TennisTV benchmark for evaluating 17 MLLMs on tennis video understanding from stroke-level to rally-level tasks with automated pipelines and human verification.

ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

cs.CV · 2026-05-19 · unverdicted · novelty 6.0 · 2 refs

ParaVT introduces the first multi-agent RL framework for parallel video tool calling in LMMs, using PARA-GRPO to resolve the Tool Prior Paradox and achieve +7.9% average improvement over Qwen3-VL baseline across six benchmarks.

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

VideoSeeker integrates agentic reasoning and visual prompts into LVLMs via automated data synthesis, cold-start supervision, and RL training, yielding +13.7% gains on instance-level video tasks over baselines including GPT-4o.

Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.

AdaTooler-V: Adaptive Tool-Use for Images and Videos

cs.CV · 2025-12-18 · conditional · novelty 6.0

AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

cs.CV · 2025-11-25 · unverdicted · novelty 6.0

LongVT adds native video-cropping tool calling to LMMs for interleaved multimodal chain-of-tool-thought reasoning on long videos and releases VideoSIAH data for training and evaluation.

MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding

cs.CV · 2025-05-27 · unverdicted · novelty 6.0

MUSEG applies timestamp-aware multi-segment grounding with a phased-reward RL recipe to boost temporal grounding and time-sensitive video QA performance in MLLMs.

VISD: Enhancing Video Reasoning via Structured Self-Distillation

cs.CV · 2026-05-07 · unverdicted · novelty 5.0 · 4 refs

VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.

TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning

cs.CV · 2025-12-03 · unverdicted · novelty 5.0

TempR1 applies temporal-aware multi-task RL using GRPO and three types of localization rewards to achieve SOTA temporal understanding in MLLMs with synergistic gains from joint optimization.

CurEvo: Curriculum-Guided Self-Evolution for Video Understanding

cs.CV · 2026-04-29 · unverdicted · novelty 4.0

CurEvo integrates curriculum guidance into self-evolution to structure autonomous improvement of video understanding models, yielding gains on VideoQA benchmarks.

citing papers explorer

Showing 14 of 14 citing papers.

Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models cs.CV · 2026-05-09 · unverdicted · none · ref 27
STEMO-Bench evaluates intermediate spatio-temporal reasoning in video MLLMs via object-centric facts, and STEMO-Track improves consistency by chunk-wise trajectory construction and aggregation.
Act2See: Emergent Active Visual Perception for Video Reasoning cs.CV · 2026-05-03 · unverdicted · none · ref 29
Act2See trains VLMs via supervised fine-tuning on verified reasoning traces to interleave active frame calls within text CoTs, yielding SOTA results on video reasoning benchmarks.
Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench cs.AI · 2026-04-17 · conditional · none · ref 8
AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.
VIDEOP2R: Video Understanding from Perception to Reasoning cs.CV · 2025-11-14 · conditional · none · ref 52
VideoP2R separates perception and reasoning in a process-aware RFT pipeline with a new CoT dataset and PA-GRPO rewards, reaching SOTA on six of seven video benchmarks.
TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies? cs.CV · 2025-09-19 · unverdicted · none · ref 10
Introduces TennisTV benchmark for evaluating 17 MLLMs on tennis video understanding from stroke-level to rally-level tasks with automated pipelines and human verification.
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning cs.CV · 2026-05-19 · unverdicted · none · ref 33 · 2 links
ParaVT introduces the first multi-agent RL framework for parallel video tool calling in LMMs, using PARA-GRPO to resolve the Tool Prior Paradox and achieve +7.9% average improvement over Qwen3-VL baseline across six benchmarks.
VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation cs.CV · 2026-05-15 · unverdicted · none · ref 34
VideoSeeker integrates agentic reasoning and visual prompts into LVLMs via automated data synthesis, cold-start supervision, and RL training, yielding +13.7% gains on instance-level video tasks over baselines including GPT-4o.
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning cs.CV · 2026-04-06 · unverdicted · none · ref 37
RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
AdaTooler-V: Adaptive Tool-Use for Images and Videos cs.CV · 2025-12-18 · conditional · none · ref 67
AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling cs.CV · 2025-11-25 · unverdicted · none · ref 46
LongVT adds native video-cropping tool calling to LMMs for interleaved multimodal chain-of-tool-thought reasoning on long videos and releases VideoSIAH data for training and evaluation.
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding cs.CV · 2025-05-27 · unverdicted · none · ref 32
MUSEG applies timestamp-aware multi-segment grounding with a phased-reward RL recipe to boost temporal grounding and time-sensitive video QA performance in MLLMs.
VISD: Enhancing Video Reasoning via Structured Self-Distillation cs.CV · 2026-05-07 · unverdicted · none · ref 40 · 4 links
VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.
TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning cs.CV · 2025-12-03 · unverdicted · none · ref 55
TempR1 applies temporal-aware multi-task RL using GRPO and three types of localization rewards to achieve SOTA temporal understanding in MLLMs with synergistic gains from joint optimization.
CurEvo: Curriculum-Guided Self-Evolution for Video Understanding cs.CV · 2026-04-29 · unverdicted · none · ref 77
CurEvo integrates curriculum guidance into self-evolution to structure autonomous improvement of video understanding models, yielding gains on VideoQA benchmarks.

Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer