pith. machine review for the scientific record. sign in

arxiv: 2605.13803 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: unknown

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:24 UTC · model grok-4.3

classification 💻 cs.CV
keywords video temporal groundingself-evolving agentsunsupervised learningvideo captioningreinforcement learningproposer-solver loop
0
0 comments X

The pith

Two self-evolving agents learn video temporal grounding from unlabeled videos alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EvoGround as a pair of coupled agents that start from a shared backbone and learn to localize text-described moments in raw video. A proposer agent invents query-moment pairs while a solver agent practices grounding them; the solver's success signals are used to refine the proposer, forming a closed reinforcement loop that runs without any human labels or external rewards. After training on 2.5K unlabeled videos the resulting system reaches or exceeds the accuracy of fully supervised models on standard VTG benchmarks and simultaneously produces detailed video captions at state-of-the-art level. The central demonstration is that the mutual-improvement loop itself is sufficient to bootstrap both grounding and captioning capabilities.

Core claim

EvoGround shows that a proposer agent generating pseudo query-moment pairs and a solver agent learning to localize them can iteratively improve each other through reinforcement feedback, achieving supervised-level temporal grounding performance and strong fine-grained captioning when trained only on 2.5K raw videos.

What carries the argument

The mutual reinforcement loop in which the proposer generates query-moment pairs from raw video and the solver returns grounding signals that update the proposer.

If this is right

  • The system matches or surpasses fully supervised models on multiple VTG benchmarks after training on only 2.5K unlabeled videos.
  • It produces state-of-the-art fine-grained video captions without any manual labels.
  • Both agents improve across successive iterations of the self-reinforcing loop.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same loop could be scaled to much larger unlabeled video collections to further close or exceed the gap to supervised methods.
  • The approach may transfer to other video-language tasks that currently depend on expensive temporal annotations.
  • Because no external reward model is used, the framework's success depends entirely on the internal consistency of the generated pairs and grounding signals.

Load-bearing premise

The mutual reinforcement loop between proposer and solver can bootstrap effective temporal grounding and captioning capabilities starting from raw videos and a shared backbone without any initial human supervision or external reward signals.

What would settle it

Run the full training loop on the same 2.5K videos and measure grounding accuracy on a held-out benchmark; if performance stays at random baseline levels with no measurable improvement across iterations, the bootstrapping claim is false.

Figures

Figures reproduced from arXiv: 2605.13803 by Byoung-Tak Zhang, Lorenzo Torresani, Minjoon Jung.

Figure 1
Figure 1. Figure 1: EvoGround: a self-evolving loop, with unlabeled videos. A proposer and a solver, both initialized from the same base model, co-evolve through reinforcement learning. The proposer generates query (q)–moment (m) pairs from a raw video; the solver grounds them and produces predictions (mˆ ) that feed back as a learning signal. Dedicated reward designs guide each agent. then attempts to ground these queries ba… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of EvoGround. Both agents start from the same backbone. The proposer is updated via three rewards: R prop format (validity), R prop consistency (consistency, computed with SigLIP-2), and R prop feedback (solvability, derived from the solver’s tIoU). The solver is updated via Rsol format and Rsol acc. Stages alternate: the proposer’s pairs train the solver, the solver’s predictions sharpen the prop… view at source ↗
Figure 3
Figure 3. Figure 3: Reward dynamics across iterations. We visualize the evolution of the proposer and solver in (a) and (b), respectively. As the proposer evolves over iterations, the solver correspondingly demonstrates progressively higher accuracy. 3.2 Solver From the query-moment pairs generated by the proposer, each pair is processed independently. We consider a single query–moment pair (q, m). Given the query q, the solv… view at source ↗
Figure 4
Figure 4. Figure 4: Improvements across iterations on TVGBench. (a) shows performance using different learning objectives. (b) and (c) show performance across different video and moment lengths [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Generated data distributions across different reward configurations. Top: Kernel density estimates of normalized start and end times of the moment, shown as solid and dashed lines, respectively. All times are normalized by video duration. Bottom: We report the correlation (r) between query and moment length, along with the mean ˘ standard deviation of moment lengths [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt designs of EvoGround. We show the prompts used for the proposer (top) and solver (bottom). The proposer is instructed to generate consecutive, non-overlapping query–moment pairs from a raw video, while the solver is instructed to localize a given query within the video. resulting variance σn is bounded within a limited range. In the ideal case, where all frames within a moment are perfectly aligned … view at source ↗
Figure 7
Figure 7. Figure 7: Query length distribution across iterations. We visualize the query length distributions across different reward configurations and iterations. As previously discussed in Section 5, the feedback reward increases the length of queries compared to others. A.4 Impact of Training Data and Comparison with Time-R1 A potential concern is that the performance of EvoGround may be influenced by the underlying video … view at source ↗
Figure 8
Figure 8. Figure 8: Per-sample IoU improvements under different thresholds δ. The dashed line represents no change, and the upper-left regions represent improved cases [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Captioning results on TemporalBench [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Generated query–moment pairs from EvoGround. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
read the original abstract

Video temporal grounding (VTG) takes an untrimmed video and a natural-language query as input and localizes the temporal moment that best matches the query. Existing methods rely on large, task-specific datasets requiring costly manual annotation. We introduce EvoGround, a framework of two coupled self-evolving agents, a proposer and a solver, that learn temporal grounding from raw videos without any human-labeled data. The proposer generates query--moment pairs from raw videos, while the solver learns to ground them and feeds back signals that improve the proposer in return. Through this self-reinforcing reinforcement-learning loop, the two agents are initialized from the same backbone and mutually improve across iterations. Trained on 2.5K unlabeled videos, EvoGround matches or surpasses fully supervised models across multiple VTG benchmarks, while emerging as a state-of-the-art fine-grained video captioner without manual labels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces EvoGround, a framework consisting of a proposer agent that generates query-moment pairs from raw videos and a solver agent that performs temporal grounding, coupled in a self-reinforcing RL loop initialized from the same backbone. Trained on 2.5K unlabeled videos, it claims to match or surpass fully supervised models on multiple VTG benchmarks while also emerging as a state-of-the-art fine-grained video captioner without any manual labels.

Significance. If the central performance claims hold under rigorous validation, the work would represent a notable advance in unsupervised video temporal grounding by demonstrating that mutual agent improvement can reduce dependence on large annotated datasets. The self-evolving loop is a conceptually interesting direction, though its effectiveness hinges on unverified dynamics that could either bootstrap genuine capabilities or reinforce degenerate alignments.

major comments (2)
  1. Abstract: The claim that the proposer-solver loop 'mutually improve across iterations' and matches supervised performance is load-bearing for the central contribution, yet the abstract provides no details on reward formulation, self-consistency metrics, or training dynamics; without these, it is impossible to evaluate whether the loop avoids the risk of converging to consistent but inaccurate pseudo-labels from early low-quality proposals.
  2. The description of the self-reinforcing reinforcement-learning loop (abstract): the feedback signals from solver to proposer are stated to derive from internal matching scores or reconstruction losses, but no mechanism is specified to escape potential stable but degenerate equilibria, which directly undermines the assumption that capabilities emerge purely from raw videos without external anchors.
minor comments (2)
  1. Abstract: Specify the exact benchmarks used for VTG evaluation and the quantitative margins by which EvoGround matches or surpasses supervised baselines.
  2. Abstract: Clarify whether the 2.5K videos are drawn from a single source or multiple datasets, as this affects reproducibility of the unsupervised setting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We agree that the abstract requires expansion to better substantiate the central claims about the self-reinforcing loop. We address each point below and will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: Abstract: The claim that the proposer-solver loop 'mutually improve across iterations' and matches supervised performance is load-bearing for the central contribution, yet the abstract provides no details on reward formulation, self-consistency metrics, or training dynamics; without these, it is impossible to evaluate whether the loop avoids the risk of converging to consistent but inaccurate pseudo-labels from early low-quality proposals.

    Authors: We agree the abstract is overly concise and omits key elements of the reward formulation and dynamics. The full manuscript details the reward as a weighted sum of solver matching scores and reconstruction consistency losses (Section 3.2), with training proceeding over 5 iterations on the 2.5K videos. We will revise the abstract to include a brief clause on these signals and the iterative improvement process, allowing readers to assess stability without relying solely on the main text. revision: yes

  2. Referee: The description of the self-reinforcing reinforcement-learning loop (abstract): the feedback signals from solver to proposer are stated to derive from internal matching scores or reconstruction losses, but no mechanism is specified to escape potential stable but degenerate equilibria, which directly undermines the assumption that capabilities emerge purely from raw videos without external anchors.

    Authors: The manuscript specifies the feedback via internal matching scores and reconstruction losses, with the RL objective including an entropy regularization term and a proposal diversity penalty (Equation 4 in Section 3.3) to discourage collapse. We acknowledge the abstract does not mention these safeguards. We will revise the abstract to note the presence of regularization that promotes exploration and avoids degenerate equilibria. This addresses the concern while preserving the unsupervised framing. revision: partial

Circularity Check

0 steps flagged

No circularity detected in the derivation chain

full rationale

The paper presents a self-reinforcing RL loop between proposer and solver agents initialized from a shared backbone and trained on 2.5K unlabeled videos. The abstract and provided text describe mutual improvement via internal feedback signals without any equations, fitted parameters renamed as predictions, or self-citations that reduce the central claims to inputs by construction. No load-bearing step is shown to be equivalent to its own data or prior results via the enumerated patterns; the emergence of grounding and captioning capabilities is asserted as a consequence of the loop dynamics rather than a definitional or fitted tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed from abstract only; specific free parameters, axioms, and invented entities cannot be enumerated without the methods section.

axioms (1)
  • domain assumption The self-reinforcing RL loop produces net improvement in grounding accuracy without external labels or supervision
    Central mechanism asserted in the abstract but not justified or detailed here

pith-pipeline@v0.9.0 · 5449 in / 1142 out tokens · 36027 ms · 2026-05-14T19:24:40.631477+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 39 canonical work pages · 16 internal anchors

  1. [1]

    Modal-specific pseudo query generation for video corpus moment retrieval,

    M. Jung, S. Choi, J. Kim, J.-H. Kim, and B.-T. Zhang, “Modal-specific pseudo query generation for video corpus moment retrieval,” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 7769–7781, Dec. 2022

  2. [2]

    Detecting moments and highlights in videos via natural language queries,

    J. Lei, T. L. Berg, and M. Bansal, “Detecting moments and highlights in videos via natural language queries,”Advances in Neural Information Processing Systems, vol. 34, pp. 11846– 11858, 2021

  3. [3]

    Can i trust your answer? visually grounded video question answering,

    J. Xiao, A. Yao, Y . Li, and T.-S. Chua, “Can i trust your answer? visually grounded video question answering,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13204–13214, 2024

  4. [4]

    Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models,

    Z. Shangguan, C. Li, Y . Ding, Y . Zheng, Y . Zhao, T. Fitzgerald, and A. Cohan, “Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models,”arXiv preprint arXiv:2410.23266, 2024

  5. [5]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    C. Fu, Y . Dai, Y . Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y . Shen, M. Zhang,et al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,”arXiv preprint arXiv:2405.21075, 2024

  6. [6]

    Egoexo-con: Exploring view-invariant video temporal understanding,

    M. Jung, J. Xiao, J. Kim, B.-T. Zhang, and A. Yao, “Egoexo-con: Exploring view-invariant video temporal understanding,”arXiv preprint arXiv:2510.26113, 2025

  7. [7]

    Tall: Temporal activity localization via language query,

    J. Gao, C. Sun, Z. Yang, and R. Nevatia, “Tall: Temporal activity localization via language query,” inProceedings of the IEEE international conference on computer vision, pp. 5267–5275, 2017

  8. [8]

    Dense regression network for video grounding,

    R. Zeng, H. Xu, W. Huang, P. Chen, M. Tan, and C. Gan, “Dense regression network for video grounding,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10287–10296, 2020

  9. [9]

    Exploiting auxiliary caption for video grounding,

    H. Li, M. Cao, X. Cheng, Y . Li, Z. Zhu, and Y . Zou, “Exploiting auxiliary caption for video grounding,” inProceedings of the AAAI conference on artificial intelligence, 2024

  10. [10]

    A closer look at temporal sentence grounding in videos: Dataset and metric,

    Y . Yuan, X. Lan, X. Wang, L. Chen, Z. Wang, and W. Zhu, “A closer look at temporal sentence grounding in videos: Dataset and metric,” inProceedings of the 2nd international workshop on human-centric multimedia analysis, pp. 13–21, 2021

  11. [11]

    Uncovering hidden challenges in query- based video moment retrieval,

    M. Otani, Y . Nakashima, E. Rahtu, and J. Heikkilä, “Uncovering hidden challenges in query- based video moment retrieval,”arXiv preprint arXiv:2009.00325, 2020

  12. [12]

    On the consistency of video large language models in temporal comprehension,

    M. Jung, J. Xiao, B.-T. Zhang, and A. Yao, “On the consistency of video large language models in temporal comprehension,” inProceedings of the Computer Vision and Pattern Recognition Conference, pp. 13713–13722, 2025

  13. [13]

    Dense-captioning events in videos,

    R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles, “Dense-captioning events in videos,” inInternational Conference on Computer Vision (ICCV), 2017

  14. [14]

    Time-r1: Post-training large vision language model for temporal video grounding,

    Y . Wang, Z. Wang, B. Xu, Y . Du, K. Lin, Z. Xiao, Z. Yue, J. Ju, L. Zhang, D. Yang,et al., “Time-r1: Post-training large vision language model for temporal video grounding,”arXiv preprint arXiv:2503.13377, 2025

  15. [15]

    Rextime: A benchmark suite for reasoning-across-time in videos,

    J.-J. Chen, Y .-C. Liao, H.-C. Lin, Y .-C. Yu, Y .-C. Chen, and F. Wang, “Rextime: A benchmark suite for reasoning-across-time in videos,”Advances in Neural Information Processing Systems, vol. 37, pp. 28662–28673, 2024

  16. [16]

    Et bench: Towards open-ended event- level video-language understanding,

    Y . Liu, Z. Ma, Z. Qi, Y . Wu, Y . Shan, and C. W. Chen, “Et bench: Towards open-ended event- level video-language understanding,”Advances in Neural Information Processing Systems, vol. 37, pp. 32076–32110, 2024

  17. [17]

    Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models,

    M. Cai, R. Tan, J. Zhang, B. Zou, K. Zhang, F. Yao, F. Zhu, J. Gu, Y . Zhong, Y . Shang,et al., “Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models,”arXiv preprint arXiv:2410.10818, 2024. 10

  18. [18]

    Learning transferable visual models from natural language supervi- sion,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervi- sion,” inInternational conference on machine learning, pp. 8748–8763, PmLR, 2021

  19. [19]

    Learning 2d temporal adjacent networks for moment local- ization with natural language,

    S. Zhang, H. Peng, J. Fu, and J. Luo, “Learning 2d temporal adjacent networks for moment local- ization with natural language,” inProceedings of the AAAI Conference on Artificial Intelligence, pp. 12870–12877, 2020

  20. [20]

    Span-based localizing network for natural language video localization,

    H. Zhang, A. Sun, W. Jing, and J. T. Zhou, “Span-based localizing network for natural language video localization,”arXiv preprint arXiv:2004.13931, 2020

  21. [21]

    Local-global video-text interactions for temporal grounding,

    J. Mun, M. Cho, and B. Han, “Local-global video-text interactions for temporal grounding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10810–10819, 2020

  22. [22]

    Background-aware moment detection for video moment retrieval,

    M. Jung, Y . Jang, S. Choi, J. Kim, J.-H. Kim, and B.-T. Zhang, “Background-aware moment detection for video moment retrieval,” inProceedings of the Winter Conference on Applications of Computer Vision, pp. 8575–8585, February 2025

  23. [23]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,

    W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez,et al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” See https://vicuna. lmsys. org (accessed 14 April 2023), 2023

  24. [24]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar,et al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

  25. [25]

    Video-chatgpt: Towards detailed video understanding via large vision and language models,

    M. Maaz, H. Rasheed, S. Khan, and F. S. Khan, “Video-chatgpt: Towards detailed video understanding via large vision and language models,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), 2024

  26. [26]

    VideoChat: Chat-Centric Video Understanding

    K. Li, Y . He, Y . Wang, Y . Li, W. Wang, P. Luo, Y . Wang, L. Wang, and Y . Qiao, “Videochat: Chat-centric video understanding,”arXiv preprint arXiv:2305.06355, 2023

  27. [27]

    Moviechat: From dense token to sparse memory for long video understanding,

    E. Song, W. Chai, G. Wang, Y . Zhang, H. Zhou, F. Wu, H. Chi, X. Guo, T. Ye, Y . Zhang,et al., “Moviechat: From dense token to sparse memory for long video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18221–18232, 2024

  28. [28]

    Vtimellm: Empower llm to grasp video moments,

    B. Huang, X. Wang, H. Chen, Z. Song, and W. Zhu, “Vtimellm: Empower llm to grasp video moments,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14271–14280, 2024

  29. [29]

    Vtg-llm: Integrating timestamp knowl- edge into video llms for enhanced video temporal grounding,

    Y . Guo, J. Liu, M. Li, X. Tang, X. Chen, and B. Zhao, “Vtg-llm: Integrating timestamp knowl- edge into video llms for enhanced video temporal grounding,”arXiv preprint arXiv:2405.13382, 2024

  30. [30]

    Timechat: A time-sensitive multimodal large language model for long video understanding,

    S. Ren, L. Yao, S. Li, X. Sun, and L. Hou, “Timechat: A time-sensitive multimodal large language model for long video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14313–14323, 2024

  31. [31]

    Timesuite: Improving mllms for long video understanding via grounded tuning,

    X. Zeng, K. Li, C. Wang, X. Li, T. Jiang, Z. Yan, S. Li, Y . Shi, Z. Yue, Y . Wang,et al., “Timesuite: Improving mllms for long video understanding via grounded tuning,”arXiv preprint arXiv:2410.19702, 2024

  32. [32]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi,et al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  33. [33]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    K. Feng, K. Gong, B. Li, Z. Guo, Y . Wang, T. Peng, J. Wu, X. Zhang, B. Wang, and X. Yue, “Video-r1: Reinforcing video reasoning in mllms,”arXiv preprint arXiv:2503.21776, 2025

  34. [34]

    Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning,

    X. Li, Z. Yan, D. Meng, L. Dong, X. Zeng, Y . He, Y . Wang, Y . Qiao, Y . Wang, and L. Wang, “Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning,”arXiv preprint arXiv:2504.06958, 2025. 11

  35. [35]

    Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception,

    Z. Yan, X. Li, Y . He, Z. Yue, X. Zeng, Y . Wang, Y . Qiao, L. Wang, and Y . Wang, “Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception,”arXiv preprint arXiv:2509.21100, 2025

  36. [36]

    Qwen2. 5 technical report,

    A. Y . Qwen, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei,et al., “Qwen2. 5 technical report,”arXiv preprint, 2024

  37. [37]

    Dr. zero: Self- evolving search agents without training data,

    Z. Yue, K. Upasani, X. Yang, S. Ge, S. Nie, Y . Mao, Z. Liu, and D. Wang, “Dr. zero: Self- evolving search agents without training data,”arXiv preprint arXiv:2601.07055, 2026

  38. [38]

    R-Zero: Self-Evolving Reasoning LLM from Zero Data

    C. Huang, W. Yu, X. Wang, H. Zhang, Z. Li, R. Li, J. Huang, H. Mi, and D. Yu, “R-zero: Self-evolving reasoning llm from zero data,”arXiv preprint arXiv:2508.05004, 2025

  39. [39]

    Absolute Zero: Reinforced Self-play Reasoning with Zero Data

    A. Zhao, Y . Wu, Y . Yue, T. Wu, Q. Xu, M. Lin, S. Wang, Q. Wu, Z. Zheng, and G. Huang, “Absolute zero: Reinforced self-play reasoning with zero data,”arXiv preprint arXiv:2505.03335, 2025

  40. [40]

    Mm-zero: Self-evolving multi-model vision language models from zero data,

    Z. Li, H. Du, C. Huang, X. Wu, L. Yu, Y . He, J. Xie, X. Wu, Z. Liu, J. Zhang,et al., “Mm- zero: Self-evolving multi-model vision language models from zero data,”arXiv preprint arXiv:2603.09206, 2026

  41. [41]

    Visplay: Self-evolving vision-language models from images,

    Y . He, C. Huang, Z. Li, J. Huang, and Y . Yang, “Visplay: Self-evolving vision-language models from images,”arXiv preprint arXiv:2511.15661, 2025

  42. [42]

    Evolmm: Self-evolving large multimodal models with continuous rewards,

    O. Thawakar, S. Venkatraman, R. Thawkar, A. Shaker, H. Cholakkal, R. M. Anwer, S. Khan, and F. Khan, “Evolmm: Self-evolving large multimodal models with continuous rewards,”arXiv preprint arXiv:2511.16672, 2025

  43. [43]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang,et al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

  44. [44]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y . Xia, B. Mustafa,et al., “Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features,”arXiv preprint arXiv:2502.14786, 2025

  45. [45]

    GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

    S.-Y . Liu, X. Dong, X. Lu, S. Diao, P. Belcak, M. Liu, M.-H. Chen, H. Yin, Y .-C. F. Wang, K.-T. Cheng,et al., “Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization,”arXiv preprint arXiv:2601.05242, 2026

  46. [46]

    Hawkeye: Training video-text llms for grounding text in videos,

    Y . Wang, X. Meng, J. Liang, Y . Wang, Q. Liu, and D. Zhao, “Hawkeye: Training video-text llms for grounding text in videos,”arXiv preprint arXiv:2403.10228, 2024

  47. [47]

    Momentor: Ad- vancing video large language model with fine-grained temporal reasoning,

    L. Qian, J. Li, Y . Wu, Y . Ye, H. Fei, T.-S. Chua, Y . Zhuang, and S. Tang, “Momentor: Ad- vancing video large language model with fine-grained temporal reasoning,”arXiv preprint arXiv:2402.11435, 2024

  48. [48]

    Grounded- videollm: Sharpening fine-grained temporal grounding in video large language models,

    H. Wang, Z. Xu, Y . Cheng, S. Diao, Y . Zhou, Y . Cao, Q. Wang, W. Ge, and L. Huang, “Grounded- videollm: Sharpening fine-grained temporal grounding in video large language models,”arXiv preprint arXiv:2410.03290, 2024

  49. [49]

    Trace: Temporal grounding video llm via causal event modeling,

    Y . Guo, J. Liu, M. Li, Q. Liu, X. Chen, and X. Tang, “Trace: Temporal grounding video llm via causal event modeling,”arXiv preprint arXiv:2410.05643, 2024

  50. [50]

    Enrich and detect: Video temporal grounding with multimodal llms,

    S. Pramanick, E. Mavroudi, Y . Song, R. Chellappa, L. Torresani, and T. Afouras, “Enrich and detect: Video temporal grounding with multimodal llms,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 24297–24308, 2025

  51. [51]

    Videochat-flash: Hierarchical compression for long-context video modeling,

    X. Li, Y . Wang, J. Yu, X. Zeng, Y . Zhu, H. Huang, J. Gao, K. Li, Y . He, C. Wang,et al., “Videochat-flash: Hierarchical compression for long-context video modeling,”arXiv preprint arXiv:2501.00574, 2024

  52. [52]

    Lita: Language instructed temporal-localization assistant,

    D.-A. Huang, S. Liao, S. Radhakrishnan, H. Yin, P. Molchanov, Z. Yu, and J. Kautz, “Lita: Language instructed temporal-localization assistant,”arXiv preprint arXiv:2403.19046, 2024. 12

  53. [53]

    LLaVA-OneVision: Easy Visual Task Transfer

    B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liu,et al., “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024

  54. [54]

    Llava-next: Improved reasoning, ocr, and world knowledge,

    H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” January 2024

  55. [55]

    Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

    P. Zhang, X. Dong, Y . Zang, Y . Cao, R. Qian, L. Chen, Q. Guo, H. Duan, B. Wang, L. Ouyang, et al., “Internlm-xcomposer-2.5: A versatile large vision language model supporting long- contextual input and output,”arXiv preprint arXiv:2407.03320, 2024

  56. [56]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    B. Lin, B. Zhu, Y . Ye, M. Ning, P. Jin, and L. Yuan, “Video-llava: Learning united visual representation by alignment before projection,”arXiv preprint arXiv:2311.10122, 2023

  57. [57]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Y . Yao, T. Yu, A. Zhang, C. Wang, J. Cui, H. Zhu, T. Cai, H. Li, W. Zhao, Z. He,et al., “Minicpm-v: A gpt-4v level mllm on your phone,”arXiv preprint arXiv:2408.01800, 2024

  58. [58]

    Phi-4 Technical Report

    M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmann,et al., “Phi-4 technical report,”arXiv preprint arXiv:2412.08905, 2024

  59. [59]

    Ma- lmm: Memory-augmented large multimodal model for long-term video understanding,

    B. He, H. Li, Y . K. Jang, M. Jia, X. Cao, A. Shah, A. Shrivastava, and S.-N. Lim, “Ma- lmm: Memory-augmented large multimodal model for long-term video understanding,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13504– 13514, 2024

  60. [60]

    Matryoshka multimodal models,

    M. Cai, J. Yang, J. Gao, and Y . J. Lee, “Matryoshka multimodal models,”arXiv preprint arXiv:2405.17430, 2024

  61. [61]

    Cider: Consensus-based image description evaluation,

    R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575, 2015

  62. [62]

    Bleu: a method for automatic evaluation of machine translation,

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002

  63. [63]

    Rouge: A package for automatic evaluation of summaries,

    C.-Y . Lin, “Rouge: A package for automatic evaluation of summaries,” inText summarization branches out, pp. 74–81, 2004

  64. [64]

    Sentence-bert: Sentence embeddings using siamese bert- networks,

    N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert- networks,” inProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), pp. 3982–3992, 2019

  65. [65]

    Qwen3-VL Technical Report

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge,et al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

  66. [66]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao,et al., “Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025

  67. [67]

    Tempcompass: Do video llms really understand videos?,

    Y . Liu, S. Li, Y . Liu, Y . Wang, S. Ren, L. Li, S. Chen, X. Sun, and L. Hou, “Tempcompass: Do video llms really understand videos?,”arXiv preprint arXiv:2403.00476, 2024

  68. [68]

    Defining and characterizing reward gaming,

    J. Skalse, N. Howe, D. Krasheninnikov, and D. Krueger, “Defining and characterizing reward gaming,”Advances in Neural Information Processing Systems, vol. 35, pp. 9460–9471, 2022

  69. [69]

    Chain-of- thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou,et al., “Chain-of- thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24824–24837, 2022. 13 Appendix In this section, we provide details that are not included in the main manuscript due to the page limit. ...