Video-OPD uses on-policy distillation from a frontier teacher to turn sparse episode rewards into dense step-wise signals for more efficient post-training of MLLMs on temporal video grounding.
Tempo-r0: A video-mllm for temporal video grounding through efficient temporal sensing reinforcement learning.arXiv preprint arXiv:2507.04702, 2025
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
MLLMs know event timing during prefill via sparse Temporal Grounding Heads but lose it in autoregressive decoding; restricting visual context to the high-attention interval at inference time improves VTG performance on three benchmarks.
citing papers explorer
-
Video-OPD: Efficient Post-Training of Multimodal Large Language Models for Temporal Video Grounding via On-Policy Distillation
Video-OPD uses on-policy distillation from a frontier teacher to turn sparse episode rewards into dense step-wise signals for more efficient post-training of MLLMs on temporal video grounding.
-
MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues
MLLMs know event timing during prefill via sparse Temporal Grounding Heads but lose it in autoregressive decoding; restricting visual context to the high-attention interval at inference time improves VTG performance on three benchmarks.