arxiv: 2605.13803 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: unknown

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

Minjoon Jung , Byoung-Tak Zhang , Lorenzo Torresani

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords video temporal groundingself-evolving agentsunsupervised learningvideo captioningreinforcement learningproposer-solver loop

0 comments

The pith

Two self-evolving agents learn video temporal grounding from unlabeled videos alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EvoGround as a pair of coupled agents that start from a shared backbone and learn to localize text-described moments in raw video. A proposer agent invents query-moment pairs while a solver agent practices grounding them; the solver's success signals are used to refine the proposer, forming a closed reinforcement loop that runs without any human labels or external rewards. After training on 2.5K unlabeled videos the resulting system reaches or exceeds the accuracy of fully supervised models on standard VTG benchmarks and simultaneously produces detailed video captions at state-of-the-art level. The central demonstration is that the mutual-improvement loop itself is sufficient to bootstrap both grounding and captioning capabilities.

Core claim

EvoGround shows that a proposer agent generating pseudo query-moment pairs and a solver agent learning to localize them can iteratively improve each other through reinforcement feedback, achieving supervised-level temporal grounding performance and strong fine-grained captioning when trained only on 2.5K raw videos.

What carries the argument

The mutual reinforcement loop in which the proposer generates query-moment pairs from raw video and the solver returns grounding signals that update the proposer.

If this is right

The system matches or surpasses fully supervised models on multiple VTG benchmarks after training on only 2.5K unlabeled videos.
It produces state-of-the-art fine-grained video captions without any manual labels.
Both agents improve across successive iterations of the self-reinforcing loop.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same loop could be scaled to much larger unlabeled video collections to further close or exceed the gap to supervised methods.
The approach may transfer to other video-language tasks that currently depend on expensive temporal annotations.
Because no external reward model is used, the framework's success depends entirely on the internal consistency of the generated pairs and grounding signals.

Load-bearing premise

The mutual reinforcement loop between proposer and solver can bootstrap effective temporal grounding and captioning capabilities starting from raw videos and a shared backbone without any initial human supervision or external reward signals.

What would settle it

Run the full training loop on the same 2.5K videos and measure grounding accuracy on a held-out benchmark; if performance stays at random baseline levels with no measurable improvement across iterations, the bootstrapping claim is false.

Figures

Figures reproduced from arXiv: 2605.13803 by Byoung-Tak Zhang, Lorenzo Torresani, Minjoon Jung.

**Figure 1.** Figure 1: EvoGround: a self-evolving loop, with unlabeled videos. A proposer and a solver, both initialized from the same base model, co-evolve through reinforcement learning. The proposer generates query (q)–moment (m) pairs from a raw video; the solver grounds them and produces predictions (mˆ ) that feed back as a learning signal. Dedicated reward designs guide each agent. then attempts to ground these queries ba… view at source ↗

**Figure 2.** Figure 2: Overview of EvoGround. Both agents start from the same backbone. The proposer is updated via three rewards: R prop format (validity), R prop consistency (consistency, computed with SigLIP-2), and R prop feedback (solvability, derived from the solver’s tIoU). The solver is updated via Rsol format and Rsol acc. Stages alternate: the proposer’s pairs train the solver, the solver’s predictions sharpen the prop… view at source ↗

**Figure 3.** Figure 3: Reward dynamics across iterations. We visualize the evolution of the proposer and solver in (a) and (b), respectively. As the proposer evolves over iterations, the solver correspondingly demonstrates progressively higher accuracy. 3.2 Solver From the query-moment pairs generated by the proposer, each pair is processed independently. We consider a single query–moment pair (q, m). Given the query q, the solv… view at source ↗

**Figure 4.** Figure 4: Improvements across iterations on TVGBench. (a) shows performance using different learning objectives. (b) and (c) show performance across different video and moment lengths [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Generated data distributions across different reward configurations. Top: Kernel density estimates of normalized start and end times of the moment, shown as solid and dashed lines, respectively. All times are normalized by video duration. Bottom: We report the correlation (r) between query and moment length, along with the mean ˘ standard deviation of moment lengths [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt designs of EvoGround. We show the prompts used for the proposer (top) and solver (bottom). The proposer is instructed to generate consecutive, non-overlapping query–moment pairs from a raw video, while the solver is instructed to localize a given query within the video. resulting variance σn is bounded within a limited range. In the ideal case, where all frames within a moment are perfectly aligned … view at source ↗

**Figure 7.** Figure 7: Query length distribution across iterations. We visualize the query length distributions across different reward configurations and iterations. As previously discussed in Section 5, the feedback reward increases the length of queries compared to others. A.4 Impact of Training Data and Comparison with Time-R1 A potential concern is that the performance of EvoGround may be influenced by the underlying video … view at source ↗

**Figure 8.** Figure 8: Per-sample IoU improvements under different thresholds δ. The dashed line represents no change, and the upper-left regions represent improved cases [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Captioning results on TemporalBench [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Generated query–moment pairs from EvoGround. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

read the original abstract

Video temporal grounding (VTG) takes an untrimmed video and a natural-language query as input and localizes the temporal moment that best matches the query. Existing methods rely on large, task-specific datasets requiring costly manual annotation. We introduce EvoGround, a framework of two coupled self-evolving agents, a proposer and a solver, that learn temporal grounding from raw videos without any human-labeled data. The proposer generates query--moment pairs from raw videos, while the solver learns to ground them and feeds back signals that improve the proposer in return. Through this self-reinforcing reinforcement-learning loop, the two agents are initialized from the same backbone and mutually improve across iterations. Trained on 2.5K unlabeled videos, EvoGround matches or surpasses fully supervised models across multiple VTG benchmarks, while emerging as a state-of-the-art fine-grained video captioner without manual labels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EvoGround's proposer-solver RL loop is a distinct try at fully unsupervised VTG but the abstract leaves the reward signals and loop stability completely opaque, so the matching-supervised claim is hard to trust yet.

read the letter

The main takeaway is that this paper sets up two agents that generate their own query-moment pairs from raw video and then learn to ground them, feeding improvements back and forth in a closed loop. Trained on 2.5K unlabeled videos, it claims to match or beat fully supervised VTG models while also producing strong fine-grained captions. That closed-loop mutual improvement is the piece that feels new compared with standard pseudo-labeling or self-training setups that usually keep some external signal or initial supervision in the mix. The framing around label cost in video grounding is straightforward and the side result on captioning is noted without overclaiming. The soft spots sit right in the middle of the central mechanism. The abstract gives no concrete description of how the solver's feedback becomes a reward for the proposer, what prevents early low-quality proposals from locking both agents into consistent but wrong alignments, or any iteration plots showing actual improvement rather than convergence. Without those pieces the performance numbers rest on an untested assumption that the internal consistency signals are self-correcting. The stress-test concern about drifting into a stable but inaccurate equilibrium looks like a live risk here because nothing in the summary shows an external anchor or escape mechanism. This is for people working on self-supervised video models who want to see whether a pure agent loop can replace labeled data. A reader already thinking about RL for grounding or captioning would get value from the setup even if they end up skeptical of the results. I would send it for peer review so the methods and dynamics can be checked in detail; the idea is specific enough to be worth referee time even if heavy revision is needed on the evidence.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces EvoGround, a framework consisting of a proposer agent that generates query-moment pairs from raw videos and a solver agent that performs temporal grounding, coupled in a self-reinforcing RL loop initialized from the same backbone. Trained on 2.5K unlabeled videos, it claims to match or surpass fully supervised models on multiple VTG benchmarks while also emerging as a state-of-the-art fine-grained video captioner without any manual labels.

Significance. If the central performance claims hold under rigorous validation, the work would represent a notable advance in unsupervised video temporal grounding by demonstrating that mutual agent improvement can reduce dependence on large annotated datasets. The self-evolving loop is a conceptually interesting direction, though its effectiveness hinges on unverified dynamics that could either bootstrap genuine capabilities or reinforce degenerate alignments.

major comments (2)

Abstract: The claim that the proposer-solver loop 'mutually improve across iterations' and matches supervised performance is load-bearing for the central contribution, yet the abstract provides no details on reward formulation, self-consistency metrics, or training dynamics; without these, it is impossible to evaluate whether the loop avoids the risk of converging to consistent but inaccurate pseudo-labels from early low-quality proposals.
The description of the self-reinforcing reinforcement-learning loop (abstract): the feedback signals from solver to proposer are stated to derive from internal matching scores or reconstruction losses, but no mechanism is specified to escape potential stable but degenerate equilibria, which directly undermines the assumption that capabilities emerge purely from raw videos without external anchors.

minor comments (2)

Abstract: Specify the exact benchmarks used for VTG evaluation and the quantitative margins by which EvoGround matches or surpasses supervised baselines.
Abstract: Clarify whether the 2.5K videos are drawn from a single source or multiple datasets, as this affects reproducibility of the unsupervised setting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We agree that the abstract requires expansion to better substantiate the central claims about the self-reinforcing loop. We address each point below and will incorporate revisions accordingly.

read point-by-point responses

Referee: Abstract: The claim that the proposer-solver loop 'mutually improve across iterations' and matches supervised performance is load-bearing for the central contribution, yet the abstract provides no details on reward formulation, self-consistency metrics, or training dynamics; without these, it is impossible to evaluate whether the loop avoids the risk of converging to consistent but inaccurate pseudo-labels from early low-quality proposals.

Authors: We agree the abstract is overly concise and omits key elements of the reward formulation and dynamics. The full manuscript details the reward as a weighted sum of solver matching scores and reconstruction consistency losses (Section 3.2), with training proceeding over 5 iterations on the 2.5K videos. We will revise the abstract to include a brief clause on these signals and the iterative improvement process, allowing readers to assess stability without relying solely on the main text. revision: yes
Referee: The description of the self-reinforcing reinforcement-learning loop (abstract): the feedback signals from solver to proposer are stated to derive from internal matching scores or reconstruction losses, but no mechanism is specified to escape potential stable but degenerate equilibria, which directly undermines the assumption that capabilities emerge purely from raw videos without external anchors.

Authors: The manuscript specifies the feedback via internal matching scores and reconstruction losses, with the RL objective including an entropy regularization term and a proposal diversity penalty (Equation 4 in Section 3.3) to discourage collapse. We acknowledge the abstract does not mention these safeguards. We will revise the abstract to note the presence of regularization that promotes exploration and avoids degenerate equilibria. This addresses the concern while preserving the unsupervised framing. revision: partial

Circularity Check

0 steps flagged

No circularity detected in the derivation chain

full rationale

The paper presents a self-reinforcing RL loop between proposer and solver agents initialized from a shared backbone and trained on 2.5K unlabeled videos. The abstract and provided text describe mutual improvement via internal feedback signals without any equations, fitted parameters renamed as predictions, or self-citations that reduce the central claims to inputs by construction. No load-bearing step is shown to be equivalent to its own data or prior results via the enumerated patterns; the emergence of grounding and captioning capabilities is asserted as a consequence of the loop dynamics rather than a definitional or fitted tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed from abstract only; specific free parameters, axioms, and invented entities cannot be enumerated without the methods section.

axioms (1)

domain assumption The self-reinforcing RL loop produces net improvement in grounding accuracy without external labels or supervision
Central mechanism asserted in the abstract but not justified or detailed here

pith-pipeline@v0.9.0 · 5449 in / 1142 out tokens · 36027 ms · 2026-05-14T19:24:40.631477+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 39 canonical work pages · 16 internal anchors

[1]

Modal-specific pseudo query generation for video corpus moment retrieval,

M. Jung, S. Choi, J. Kim, J.-H. Kim, and B.-T. Zhang, “Modal-specific pseudo query generation for video corpus moment retrieval,” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 7769–7781, Dec. 2022

2022
[2]

Detecting moments and highlights in videos via natural language queries,

J. Lei, T. L. Berg, and M. Bansal, “Detecting moments and highlights in videos via natural language queries,”Advances in Neural Information Processing Systems, vol. 34, pp. 11846– 11858, 2021

2021
[3]

Can i trust your answer? visually grounded video question answering,

J. Xiao, A. Yao, Y . Li, and T.-S. Chua, “Can i trust your answer? visually grounded video question answering,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13204–13214, 2024

2024
[4]

Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models,

Z. Shangguan, C. Li, Y . Ding, Y . Zheng, Y . Zhao, T. Fitzgerald, and A. Cohan, “Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models,”arXiv preprint arXiv:2410.23266, 2024

work page arXiv 2024
[5]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

C. Fu, Y . Dai, Y . Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y . Shen, M. Zhang,et al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,”arXiv preprint arXiv:2405.21075, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Egoexo-con: Exploring view-invariant video temporal understanding,

M. Jung, J. Xiao, J. Kim, B.-T. Zhang, and A. Yao, “Egoexo-con: Exploring view-invariant video temporal understanding,”arXiv preprint arXiv:2510.26113, 2025

work page arXiv 2025
[7]

Tall: Temporal activity localization via language query,

J. Gao, C. Sun, Z. Yang, and R. Nevatia, “Tall: Temporal activity localization via language query,” inProceedings of the IEEE international conference on computer vision, pp. 5267–5275, 2017

2017
[8]

Dense regression network for video grounding,

R. Zeng, H. Xu, W. Huang, P. Chen, M. Tan, and C. Gan, “Dense regression network for video grounding,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10287–10296, 2020

2020
[9]

Exploiting auxiliary caption for video grounding,

H. Li, M. Cao, X. Cheng, Y . Li, Z. Zhu, and Y . Zou, “Exploiting auxiliary caption for video grounding,” inProceedings of the AAAI conference on artificial intelligence, 2024

2024
[10]

A closer look at temporal sentence grounding in videos: Dataset and metric,

Y . Yuan, X. Lan, X. Wang, L. Chen, Z. Wang, and W. Zhu, “A closer look at temporal sentence grounding in videos: Dataset and metric,” inProceedings of the 2nd international workshop on human-centric multimedia analysis, pp. 13–21, 2021

2021
[11]

Uncovering hidden challenges in query- based video moment retrieval,

M. Otani, Y . Nakashima, E. Rahtu, and J. Heikkilä, “Uncovering hidden challenges in query- based video moment retrieval,”arXiv preprint arXiv:2009.00325, 2020

work page arXiv 2009
[12]

On the consistency of video large language models in temporal comprehension,

M. Jung, J. Xiao, B.-T. Zhang, and A. Yao, “On the consistency of video large language models in temporal comprehension,” inProceedings of the Computer Vision and Pattern Recognition Conference, pp. 13713–13722, 2025

2025
[13]

Dense-captioning events in videos,

R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles, “Dense-captioning events in videos,” inInternational Conference on Computer Vision (ICCV), 2017

2017
[14]

Time-r1: Post-training large vision language model for temporal video grounding,

Y . Wang, Z. Wang, B. Xu, Y . Du, K. Lin, Z. Xiao, Z. Yue, J. Ju, L. Zhang, D. Yang,et al., “Time-r1: Post-training large vision language model for temporal video grounding,”arXiv preprint arXiv:2503.13377, 2025

work page arXiv 2025
[15]

Rextime: A benchmark suite for reasoning-across-time in videos,

J.-J. Chen, Y .-C. Liao, H.-C. Lin, Y .-C. Yu, Y .-C. Chen, and F. Wang, “Rextime: A benchmark suite for reasoning-across-time in videos,”Advances in Neural Information Processing Systems, vol. 37, pp. 28662–28673, 2024

2024
[16]

Et bench: Towards open-ended event- level video-language understanding,

Y . Liu, Z. Ma, Z. Qi, Y . Wu, Y . Shan, and C. W. Chen, “Et bench: Towards open-ended event- level video-language understanding,”Advances in Neural Information Processing Systems, vol. 37, pp. 32076–32110, 2024

2024
[17]

Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models,

M. Cai, R. Tan, J. Zhang, B. Zou, K. Zhang, F. Yao, F. Zhu, J. Gu, Y . Zhong, Y . Shang,et al., “Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models,”arXiv preprint arXiv:2410.10818, 2024. 10

work page arXiv 2024
[18]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervi- sion,” inInternational conference on machine learning, pp. 8748–8763, PmLR, 2021

2021
[19]

Learning 2d temporal adjacent networks for moment local- ization with natural language,

S. Zhang, H. Peng, J. Fu, and J. Luo, “Learning 2d temporal adjacent networks for moment local- ization with natural language,” inProceedings of the AAAI Conference on Artificial Intelligence, pp. 12870–12877, 2020

2020
[20]

Span-based localizing network for natural language video localization,

H. Zhang, A. Sun, W. Jing, and J. T. Zhou, “Span-based localizing network for natural language video localization,”arXiv preprint arXiv:2004.13931, 2020

work page arXiv 2004
[21]

Local-global video-text interactions for temporal grounding,

J. Mun, M. Cho, and B. Han, “Local-global video-text interactions for temporal grounding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10810–10819, 2020

2020
[22]

Background-aware moment detection for video moment retrieval,

M. Jung, Y . Jang, S. Choi, J. Kim, J.-H. Kim, and B.-T. Zhang, “Background-aware moment detection for video moment retrieval,” inProceedings of the Winter Conference on Applications of Computer Vision, pp. 8575–8585, February 2025

2025
[23]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,

W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez,et al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” See https://vicuna. lmsys. org (accessed 14 April 2023), 2023

2023
[24]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar,et al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Video-chatgpt: Towards detailed video understanding via large vision and language models,

M. Maaz, H. Rasheed, S. Khan, and F. S. Khan, “Video-chatgpt: Towards detailed video understanding via large vision and language models,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), 2024

2024
[26]

VideoChat: Chat-Centric Video Understanding

K. Li, Y . He, Y . Wang, Y . Li, W. Wang, P. Luo, Y . Wang, L. Wang, and Y . Qiao, “Videochat: Chat-centric video understanding,”arXiv preprint arXiv:2305.06355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Moviechat: From dense token to sparse memory for long video understanding,

E. Song, W. Chai, G. Wang, Y . Zhang, H. Zhou, F. Wu, H. Chi, X. Guo, T. Ye, Y . Zhang,et al., “Moviechat: From dense token to sparse memory for long video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18221–18232, 2024

2024
[28]

Vtimellm: Empower llm to grasp video moments,

B. Huang, X. Wang, H. Chen, Z. Song, and W. Zhu, “Vtimellm: Empower llm to grasp video moments,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14271–14280, 2024

2024
[29]

Vtg-llm: Integrating timestamp knowl- edge into video llms for enhanced video temporal grounding,

Y . Guo, J. Liu, M. Li, X. Tang, X. Chen, and B. Zhao, “Vtg-llm: Integrating timestamp knowl- edge into video llms for enhanced video temporal grounding,”arXiv preprint arXiv:2405.13382, 2024

work page arXiv 2024
[30]

Timechat: A time-sensitive multimodal large language model for long video understanding,

S. Ren, L. Yao, S. Li, X. Sun, and L. Hou, “Timechat: A time-sensitive multimodal large language model for long video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14313–14323, 2024

2024
[31]

Timesuite: Improving mllms for long video understanding via grounded tuning,

X. Zeng, K. Li, C. Wang, X. Li, T. Jiang, Z. Yan, S. Li, Y . Shi, Z. Yue, Y . Wang,et al., “Timesuite: Improving mllms for long video understanding via grounded tuning,”arXiv preprint arXiv:2410.19702, 2024

work page arXiv 2024
[32]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi,et al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Video-R1: Reinforcing Video Reasoning in MLLMs

K. Feng, K. Gong, B. Li, Z. Guo, Y . Wang, T. Peng, J. Wu, X. Zhang, B. Wang, and X. Yue, “Video-r1: Reinforcing video reasoning in mllms,”arXiv preprint arXiv:2503.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning,

X. Li, Z. Yan, D. Meng, L. Dong, X. Zeng, Y . He, Y . Wang, Y . Qiao, Y . Wang, and L. Wang, “Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning,”arXiv preprint arXiv:2504.06958, 2025. 11

work page arXiv 2025
[35]

Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception,

Z. Yan, X. Li, Y . He, Z. Yue, X. Zeng, Y . Wang, Y . Qiao, L. Wang, and Y . Wang, “Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception,”arXiv preprint arXiv:2509.21100, 2025

work page arXiv 2025
[36]

Qwen2. 5 technical report,

A. Y . Qwen, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei,et al., “Qwen2. 5 technical report,”arXiv preprint, 2024

2024
[37]

Dr. zero: Self- evolving search agents without training data,

Z. Yue, K. Upasani, X. Yang, S. Ge, S. Nie, Y . Mao, Z. Liu, and D. Wang, “Dr. zero: Self- evolving search agents without training data,”arXiv preprint arXiv:2601.07055, 2026

work page arXiv 2026
[38]

R-Zero: Self-Evolving Reasoning LLM from Zero Data

C. Huang, W. Yu, X. Wang, H. Zhang, Z. Li, R. Li, J. Huang, H. Mi, and D. Yu, “R-zero: Self-evolving reasoning llm from zero data,”arXiv preprint arXiv:2508.05004, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

A. Zhao, Y . Wu, Y . Yue, T. Wu, Q. Xu, M. Lin, S. Wang, Q. Wu, Z. Zheng, and G. Huang, “Absolute zero: Reinforced self-play reasoning with zero data,”arXiv preprint arXiv:2505.03335, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Mm-zero: Self-evolving multi-model vision language models from zero data,

Z. Li, H. Du, C. Huang, X. Wu, L. Yu, Y . He, J. Xie, X. Wu, Z. Liu, J. Zhang,et al., “Mm- zero: Self-evolving multi-model vision language models from zero data,”arXiv preprint arXiv:2603.09206, 2026

work page arXiv 2026
[41]

Visplay: Self-evolving vision-language models from images,

Y . He, C. Huang, Z. Li, J. Huang, and Y . Yang, “Visplay: Self-evolving vision-language models from images,”arXiv preprint arXiv:2511.15661, 2025

work page arXiv 2025
[42]

Evolmm: Self-evolving large multimodal models with continuous rewards,

O. Thawakar, S. Venkatraman, R. Thawkar, A. Shaker, H. Cholakkal, R. M. Anwer, S. Khan, and F. Khan, “Evolmm: Self-evolving large multimodal models with continuous rewards,”arXiv preprint arXiv:2511.16672, 2025

work page arXiv 2025
[43]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang,et al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y . Xia, B. Mustafa,et al., “Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features,”arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

S.-Y . Liu, X. Dong, X. Lu, S. Diao, P. Belcak, M. Liu, M.-H. Chen, H. Yin, Y .-C. F. Wang, K.-T. Cheng,et al., “Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization,”arXiv preprint arXiv:2601.05242, 2026

work page internal anchor Pith review arXiv 2026
[46]

Hawkeye: Training video-text llms for grounding text in videos,

Y . Wang, X. Meng, J. Liang, Y . Wang, Q. Liu, and D. Zhao, “Hawkeye: Training video-text llms for grounding text in videos,”arXiv preprint arXiv:2403.10228, 2024

work page arXiv 2024
[47]

Momentor: Ad- vancing video large language model with fine-grained temporal reasoning,

L. Qian, J. Li, Y . Wu, Y . Ye, H. Fei, T.-S. Chua, Y . Zhuang, and S. Tang, “Momentor: Ad- vancing video large language model with fine-grained temporal reasoning,”arXiv preprint arXiv:2402.11435, 2024

work page arXiv 2024
[48]

Grounded- videollm: Sharpening fine-grained temporal grounding in video large language models,

H. Wang, Z. Xu, Y . Cheng, S. Diao, Y . Zhou, Y . Cao, Q. Wang, W. Ge, and L. Huang, “Grounded- videollm: Sharpening fine-grained temporal grounding in video large language models,”arXiv preprint arXiv:2410.03290, 2024

work page arXiv 2024
[49]

Trace: Temporal grounding video llm via causal event modeling,

Y . Guo, J. Liu, M. Li, Q. Liu, X. Chen, and X. Tang, “Trace: Temporal grounding video llm via causal event modeling,”arXiv preprint arXiv:2410.05643, 2024

work page arXiv 2024
[50]

Enrich and detect: Video temporal grounding with multimodal llms,

S. Pramanick, E. Mavroudi, Y . Song, R. Chellappa, L. Torresani, and T. Afouras, “Enrich and detect: Video temporal grounding with multimodal llms,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 24297–24308, 2025

2025
[51]

Videochat-flash: Hierarchical compression for long-context video modeling,

X. Li, Y . Wang, J. Yu, X. Zeng, Y . Zhu, H. Huang, J. Gao, K. Li, Y . He, C. Wang,et al., “Videochat-flash: Hierarchical compression for long-context video modeling,”arXiv preprint arXiv:2501.00574, 2024

work page arXiv 2024
[52]

Lita: Language instructed temporal-localization assistant,

D.-A. Huang, S. Liao, S. Radhakrishnan, H. Yin, P. Molchanov, Z. Yu, and J. Kautz, “Lita: Language instructed temporal-localization assistant,”arXiv preprint arXiv:2403.19046, 2024. 12

work page arXiv 2024
[53]

LLaVA-OneVision: Easy Visual Task Transfer

B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liu,et al., “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Llava-next: Improved reasoning, ocr, and world knowledge,

H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” January 2024

2024
[55]

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

P. Zhang, X. Dong, Y . Zang, Y . Cao, R. Qian, L. Chen, Q. Guo, H. Duan, B. Wang, L. Ouyang, et al., “Internlm-xcomposer-2.5: A versatile large vision language model supporting long- contextual input and output,”arXiv preprint arXiv:2407.03320, 2024

work page arXiv 2024
[56]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

B. Lin, B. Zhu, Y . Ye, M. Ning, P. Jin, and L. Yuan, “Video-llava: Learning united visual representation by alignment before projection,”arXiv preprint arXiv:2311.10122, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Y . Yao, T. Yu, A. Zhang, C. Wang, J. Cui, H. Zhu, T. Cai, H. Li, W. Zhao, Z. He,et al., “Minicpm-v: A gpt-4v level mllm on your phone,”arXiv preprint arXiv:2408.01800, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Phi-4 Technical Report

M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmann,et al., “Phi-4 technical report,”arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

Ma- lmm: Memory-augmented large multimodal model for long-term video understanding,

B. He, H. Li, Y . K. Jang, M. Jia, X. Cao, A. Shah, A. Shrivastava, and S.-N. Lim, “Ma- lmm: Memory-augmented large multimodal model for long-term video understanding,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13504– 13514, 2024

2024
[60]

Matryoshka multimodal models,

M. Cai, J. Yang, J. Gao, and Y . J. Lee, “Matryoshka multimodal models,”arXiv preprint arXiv:2405.17430, 2024

work page arXiv 2024
[61]

Cider: Consensus-based image description evaluation,

R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575, 2015

2015
[62]

Bleu: a method for automatic evaluation of machine translation,

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002

2002
[63]

Rouge: A package for automatic evaluation of summaries,

C.-Y . Lin, “Rouge: A package for automatic evaluation of summaries,” inText summarization branches out, pp. 74–81, 2004

2004
[64]

Sentence-bert: Sentence embeddings using siamese bert- networks,

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert- networks,” inProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), pp. 3982–3992, 2019

2019
[65]

Qwen3-VL Technical Report

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge,et al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao,et al., “Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Tempcompass: Do video llms really understand videos?,

Y . Liu, S. Li, Y . Liu, Y . Wang, S. Ren, L. Li, S. Chen, X. Sun, and L. Hou, “Tempcompass: Do video llms really understand videos?,”arXiv preprint arXiv:2403.00476, 2024

work page arXiv 2024
[68]

Defining and characterizing reward gaming,

J. Skalse, N. Howe, D. Krasheninnikov, and D. Krueger, “Defining and characterizing reward gaming,”Advances in Neural Information Processing Systems, vol. 35, pp. 9460–9471, 2022

2022
[69]

Chain-of- thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou,et al., “Chain-of- thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24824–24837, 2022. 13 Appendix In this section, we provide details that are not included in the main manuscript due to the page limit. ...

2022