Recognition: unknown
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
Pith reviewed 2026-05-14 19:24 UTC · model grok-4.3
The pith
Two self-evolving agents learn video temporal grounding from unlabeled videos alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EvoGround shows that a proposer agent generating pseudo query-moment pairs and a solver agent learning to localize them can iteratively improve each other through reinforcement feedback, achieving supervised-level temporal grounding performance and strong fine-grained captioning when trained only on 2.5K raw videos.
What carries the argument
The mutual reinforcement loop in which the proposer generates query-moment pairs from raw video and the solver returns grounding signals that update the proposer.
If this is right
- The system matches or surpasses fully supervised models on multiple VTG benchmarks after training on only 2.5K unlabeled videos.
- It produces state-of-the-art fine-grained video captions without any manual labels.
- Both agents improve across successive iterations of the self-reinforcing loop.
Where Pith is reading between the lines
- The same loop could be scaled to much larger unlabeled video collections to further close or exceed the gap to supervised methods.
- The approach may transfer to other video-language tasks that currently depend on expensive temporal annotations.
- Because no external reward model is used, the framework's success depends entirely on the internal consistency of the generated pairs and grounding signals.
Load-bearing premise
The mutual reinforcement loop between proposer and solver can bootstrap effective temporal grounding and captioning capabilities starting from raw videos and a shared backbone without any initial human supervision or external reward signals.
What would settle it
Run the full training loop on the same 2.5K videos and measure grounding accuracy on a held-out benchmark; if performance stays at random baseline levels with no measurable improvement across iterations, the bootstrapping claim is false.
Figures
read the original abstract
Video temporal grounding (VTG) takes an untrimmed video and a natural-language query as input and localizes the temporal moment that best matches the query. Existing methods rely on large, task-specific datasets requiring costly manual annotation. We introduce EvoGround, a framework of two coupled self-evolving agents, a proposer and a solver, that learn temporal grounding from raw videos without any human-labeled data. The proposer generates query--moment pairs from raw videos, while the solver learns to ground them and feeds back signals that improve the proposer in return. Through this self-reinforcing reinforcement-learning loop, the two agents are initialized from the same backbone and mutually improve across iterations. Trained on 2.5K unlabeled videos, EvoGround matches or surpasses fully supervised models across multiple VTG benchmarks, while emerging as a state-of-the-art fine-grained video captioner without manual labels.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces EvoGround, a framework consisting of a proposer agent that generates query-moment pairs from raw videos and a solver agent that performs temporal grounding, coupled in a self-reinforcing RL loop initialized from the same backbone. Trained on 2.5K unlabeled videos, it claims to match or surpass fully supervised models on multiple VTG benchmarks while also emerging as a state-of-the-art fine-grained video captioner without any manual labels.
Significance. If the central performance claims hold under rigorous validation, the work would represent a notable advance in unsupervised video temporal grounding by demonstrating that mutual agent improvement can reduce dependence on large annotated datasets. The self-evolving loop is a conceptually interesting direction, though its effectiveness hinges on unverified dynamics that could either bootstrap genuine capabilities or reinforce degenerate alignments.
major comments (2)
- Abstract: The claim that the proposer-solver loop 'mutually improve across iterations' and matches supervised performance is load-bearing for the central contribution, yet the abstract provides no details on reward formulation, self-consistency metrics, or training dynamics; without these, it is impossible to evaluate whether the loop avoids the risk of converging to consistent but inaccurate pseudo-labels from early low-quality proposals.
- The description of the self-reinforcing reinforcement-learning loop (abstract): the feedback signals from solver to proposer are stated to derive from internal matching scores or reconstruction losses, but no mechanism is specified to escape potential stable but degenerate equilibria, which directly undermines the assumption that capabilities emerge purely from raw videos without external anchors.
minor comments (2)
- Abstract: Specify the exact benchmarks used for VTG evaluation and the quantitative margins by which EvoGround matches or surpasses supervised baselines.
- Abstract: Clarify whether the 2.5K videos are drawn from a single source or multiple datasets, as this affects reproducibility of the unsupervised setting.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We agree that the abstract requires expansion to better substantiate the central claims about the self-reinforcing loop. We address each point below and will incorporate revisions accordingly.
read point-by-point responses
-
Referee: Abstract: The claim that the proposer-solver loop 'mutually improve across iterations' and matches supervised performance is load-bearing for the central contribution, yet the abstract provides no details on reward formulation, self-consistency metrics, or training dynamics; without these, it is impossible to evaluate whether the loop avoids the risk of converging to consistent but inaccurate pseudo-labels from early low-quality proposals.
Authors: We agree the abstract is overly concise and omits key elements of the reward formulation and dynamics. The full manuscript details the reward as a weighted sum of solver matching scores and reconstruction consistency losses (Section 3.2), with training proceeding over 5 iterations on the 2.5K videos. We will revise the abstract to include a brief clause on these signals and the iterative improvement process, allowing readers to assess stability without relying solely on the main text. revision: yes
-
Referee: The description of the self-reinforcing reinforcement-learning loop (abstract): the feedback signals from solver to proposer are stated to derive from internal matching scores or reconstruction losses, but no mechanism is specified to escape potential stable but degenerate equilibria, which directly undermines the assumption that capabilities emerge purely from raw videos without external anchors.
Authors: The manuscript specifies the feedback via internal matching scores and reconstruction losses, with the RL objective including an entropy regularization term and a proposal diversity penalty (Equation 4 in Section 3.3) to discourage collapse. We acknowledge the abstract does not mention these safeguards. We will revise the abstract to note the presence of regularization that promotes exploration and avoids degenerate equilibria. This addresses the concern while preserving the unsupervised framing. revision: partial
Circularity Check
No circularity detected in the derivation chain
full rationale
The paper presents a self-reinforcing RL loop between proposer and solver agents initialized from a shared backbone and trained on 2.5K unlabeled videos. The abstract and provided text describe mutual improvement via internal feedback signals without any equations, fitted parameters renamed as predictions, or self-citations that reduce the central claims to inputs by construction. No load-bearing step is shown to be equivalent to its own data or prior results via the enumerated patterns; the emergence of grounding and captioning capabilities is asserted as a consequence of the loop dynamics rather than a definitional or fitted tautology.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The self-reinforcing RL loop produces net improvement in grounding accuracy without external labels or supervision
Reference graph
Works this paper leans on
-
[1]
Modal-specific pseudo query generation for video corpus moment retrieval,
M. Jung, S. Choi, J. Kim, J.-H. Kim, and B.-T. Zhang, “Modal-specific pseudo query generation for video corpus moment retrieval,” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 7769–7781, Dec. 2022
2022
-
[2]
Detecting moments and highlights in videos via natural language queries,
J. Lei, T. L. Berg, and M. Bansal, “Detecting moments and highlights in videos via natural language queries,”Advances in Neural Information Processing Systems, vol. 34, pp. 11846– 11858, 2021
2021
-
[3]
Can i trust your answer? visually grounded video question answering,
J. Xiao, A. Yao, Y . Li, and T.-S. Chua, “Can i trust your answer? visually grounded video question answering,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13204–13214, 2024
2024
-
[4]
Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models,
Z. Shangguan, C. Li, Y . Ding, Y . Zheng, Y . Zhao, T. Fitzgerald, and A. Cohan, “Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models,”arXiv preprint arXiv:2410.23266, 2024
-
[5]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
C. Fu, Y . Dai, Y . Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y . Shen, M. Zhang,et al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,”arXiv preprint arXiv:2405.21075, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Egoexo-con: Exploring view-invariant video temporal understanding,
M. Jung, J. Xiao, J. Kim, B.-T. Zhang, and A. Yao, “Egoexo-con: Exploring view-invariant video temporal understanding,”arXiv preprint arXiv:2510.26113, 2025
-
[7]
Tall: Temporal activity localization via language query,
J. Gao, C. Sun, Z. Yang, and R. Nevatia, “Tall: Temporal activity localization via language query,” inProceedings of the IEEE international conference on computer vision, pp. 5267–5275, 2017
2017
-
[8]
Dense regression network for video grounding,
R. Zeng, H. Xu, W. Huang, P. Chen, M. Tan, and C. Gan, “Dense regression network for video grounding,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10287–10296, 2020
2020
-
[9]
Exploiting auxiliary caption for video grounding,
H. Li, M. Cao, X. Cheng, Y . Li, Z. Zhu, and Y . Zou, “Exploiting auxiliary caption for video grounding,” inProceedings of the AAAI conference on artificial intelligence, 2024
2024
-
[10]
A closer look at temporal sentence grounding in videos: Dataset and metric,
Y . Yuan, X. Lan, X. Wang, L. Chen, Z. Wang, and W. Zhu, “A closer look at temporal sentence grounding in videos: Dataset and metric,” inProceedings of the 2nd international workshop on human-centric multimedia analysis, pp. 13–21, 2021
2021
-
[11]
Uncovering hidden challenges in query- based video moment retrieval,
M. Otani, Y . Nakashima, E. Rahtu, and J. Heikkilä, “Uncovering hidden challenges in query- based video moment retrieval,”arXiv preprint arXiv:2009.00325, 2020
-
[12]
On the consistency of video large language models in temporal comprehension,
M. Jung, J. Xiao, B.-T. Zhang, and A. Yao, “On the consistency of video large language models in temporal comprehension,” inProceedings of the Computer Vision and Pattern Recognition Conference, pp. 13713–13722, 2025
2025
-
[13]
Dense-captioning events in videos,
R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles, “Dense-captioning events in videos,” inInternational Conference on Computer Vision (ICCV), 2017
2017
-
[14]
Time-r1: Post-training large vision language model for temporal video grounding,
Y . Wang, Z. Wang, B. Xu, Y . Du, K. Lin, Z. Xiao, Z. Yue, J. Ju, L. Zhang, D. Yang,et al., “Time-r1: Post-training large vision language model for temporal video grounding,”arXiv preprint arXiv:2503.13377, 2025
-
[15]
Rextime: A benchmark suite for reasoning-across-time in videos,
J.-J. Chen, Y .-C. Liao, H.-C. Lin, Y .-C. Yu, Y .-C. Chen, and F. Wang, “Rextime: A benchmark suite for reasoning-across-time in videos,”Advances in Neural Information Processing Systems, vol. 37, pp. 28662–28673, 2024
2024
-
[16]
Et bench: Towards open-ended event- level video-language understanding,
Y . Liu, Z. Ma, Z. Qi, Y . Wu, Y . Shan, and C. W. Chen, “Et bench: Towards open-ended event- level video-language understanding,”Advances in Neural Information Processing Systems, vol. 37, pp. 32076–32110, 2024
2024
-
[17]
Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models,
M. Cai, R. Tan, J. Zhang, B. Zou, K. Zhang, F. Yao, F. Zhu, J. Gu, Y . Zhong, Y . Shang,et al., “Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models,”arXiv preprint arXiv:2410.10818, 2024. 10
-
[18]
Learning transferable visual models from natural language supervi- sion,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervi- sion,” inInternational conference on machine learning, pp. 8748–8763, PmLR, 2021
2021
-
[19]
Learning 2d temporal adjacent networks for moment local- ization with natural language,
S. Zhang, H. Peng, J. Fu, and J. Luo, “Learning 2d temporal adjacent networks for moment local- ization with natural language,” inProceedings of the AAAI Conference on Artificial Intelligence, pp. 12870–12877, 2020
2020
-
[20]
Span-based localizing network for natural language video localization,
H. Zhang, A. Sun, W. Jing, and J. T. Zhou, “Span-based localizing network for natural language video localization,”arXiv preprint arXiv:2004.13931, 2020
-
[21]
Local-global video-text interactions for temporal grounding,
J. Mun, M. Cho, and B. Han, “Local-global video-text interactions for temporal grounding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10810–10819, 2020
2020
-
[22]
Background-aware moment detection for video moment retrieval,
M. Jung, Y . Jang, S. Choi, J. Kim, J.-H. Kim, and B.-T. Zhang, “Background-aware moment detection for video moment retrieval,” inProceedings of the Winter Conference on Applications of Computer Vision, pp. 8575–8585, February 2025
2025
-
[23]
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,
W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez,et al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” See https://vicuna. lmsys. org (accessed 14 April 2023), 2023
2023
-
[24]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar,et al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Video-chatgpt: Towards detailed video understanding via large vision and language models,
M. Maaz, H. Rasheed, S. Khan, and F. S. Khan, “Video-chatgpt: Towards detailed video understanding via large vision and language models,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), 2024
2024
-
[26]
VideoChat: Chat-Centric Video Understanding
K. Li, Y . He, Y . Wang, Y . Li, W. Wang, P. Luo, Y . Wang, L. Wang, and Y . Qiao, “Videochat: Chat-centric video understanding,”arXiv preprint arXiv:2305.06355, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Moviechat: From dense token to sparse memory for long video understanding,
E. Song, W. Chai, G. Wang, Y . Zhang, H. Zhou, F. Wu, H. Chi, X. Guo, T. Ye, Y . Zhang,et al., “Moviechat: From dense token to sparse memory for long video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18221–18232, 2024
2024
-
[28]
Vtimellm: Empower llm to grasp video moments,
B. Huang, X. Wang, H. Chen, Z. Song, and W. Zhu, “Vtimellm: Empower llm to grasp video moments,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14271–14280, 2024
2024
-
[29]
Vtg-llm: Integrating timestamp knowl- edge into video llms for enhanced video temporal grounding,
Y . Guo, J. Liu, M. Li, X. Tang, X. Chen, and B. Zhao, “Vtg-llm: Integrating timestamp knowl- edge into video llms for enhanced video temporal grounding,”arXiv preprint arXiv:2405.13382, 2024
-
[30]
Timechat: A time-sensitive multimodal large language model for long video understanding,
S. Ren, L. Yao, S. Li, X. Sun, and L. Hou, “Timechat: A time-sensitive multimodal large language model for long video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14313–14323, 2024
2024
-
[31]
Timesuite: Improving mllms for long video understanding via grounded tuning,
X. Zeng, K. Li, C. Wang, X. Li, T. Jiang, Z. Yan, S. Li, Y . Shi, Z. Yue, Y . Wang,et al., “Timesuite: Improving mllms for long video understanding via grounded tuning,”arXiv preprint arXiv:2410.19702, 2024
-
[32]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi,et al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Video-R1: Reinforcing Video Reasoning in MLLMs
K. Feng, K. Gong, B. Li, Z. Guo, Y . Wang, T. Peng, J. Wu, X. Zhang, B. Wang, and X. Yue, “Video-r1: Reinforcing video reasoning in mllms,”arXiv preprint arXiv:2503.21776, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning,
X. Li, Z. Yan, D. Meng, L. Dong, X. Zeng, Y . He, Y . Wang, Y . Qiao, Y . Wang, and L. Wang, “Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning,”arXiv preprint arXiv:2504.06958, 2025. 11
-
[35]
Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception,
Z. Yan, X. Li, Y . He, Z. Yue, X. Zeng, Y . Wang, Y . Qiao, L. Wang, and Y . Wang, “Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception,”arXiv preprint arXiv:2509.21100, 2025
-
[36]
Qwen2. 5 technical report,
A. Y . Qwen, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei,et al., “Qwen2. 5 technical report,”arXiv preprint, 2024
2024
-
[37]
Dr. zero: Self- evolving search agents without training data,
Z. Yue, K. Upasani, X. Yang, S. Ge, S. Nie, Y . Mao, Z. Liu, and D. Wang, “Dr. zero: Self- evolving search agents without training data,”arXiv preprint arXiv:2601.07055, 2026
-
[38]
R-Zero: Self-Evolving Reasoning LLM from Zero Data
C. Huang, W. Yu, X. Wang, H. Zhang, Z. Li, R. Li, J. Huang, H. Mi, and D. Yu, “R-zero: Self-evolving reasoning llm from zero data,”arXiv preprint arXiv:2508.05004, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
A. Zhao, Y . Wu, Y . Yue, T. Wu, Q. Xu, M. Lin, S. Wang, Q. Wu, Z. Zheng, and G. Huang, “Absolute zero: Reinforced self-play reasoning with zero data,”arXiv preprint arXiv:2505.03335, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Mm-zero: Self-evolving multi-model vision language models from zero data,
Z. Li, H. Du, C. Huang, X. Wu, L. Yu, Y . He, J. Xie, X. Wu, Z. Liu, J. Zhang,et al., “Mm- zero: Self-evolving multi-model vision language models from zero data,”arXiv preprint arXiv:2603.09206, 2026
-
[41]
Visplay: Self-evolving vision-language models from images,
Y . He, C. Huang, Z. Li, J. Huang, and Y . Yang, “Visplay: Self-evolving vision-language models from images,”arXiv preprint arXiv:2511.15661, 2025
-
[42]
Evolmm: Self-evolving large multimodal models with continuous rewards,
O. Thawakar, S. Venkatraman, R. Thawkar, A. Shaker, H. Cholakkal, R. M. Anwer, S. Khan, and F. Khan, “Evolmm: Self-evolving large multimodal models with continuous rewards,”arXiv preprint arXiv:2511.16672, 2025
-
[43]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang,et al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y . Xia, B. Mustafa,et al., “Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features,”arXiv preprint arXiv:2502.14786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
S.-Y . Liu, X. Dong, X. Lu, S. Diao, P. Belcak, M. Liu, M.-H. Chen, H. Yin, Y .-C. F. Wang, K.-T. Cheng,et al., “Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization,”arXiv preprint arXiv:2601.05242, 2026
work page internal anchor Pith review arXiv 2026
-
[46]
Hawkeye: Training video-text llms for grounding text in videos,
Y . Wang, X. Meng, J. Liang, Y . Wang, Q. Liu, and D. Zhao, “Hawkeye: Training video-text llms for grounding text in videos,”arXiv preprint arXiv:2403.10228, 2024
-
[47]
Momentor: Ad- vancing video large language model with fine-grained temporal reasoning,
L. Qian, J. Li, Y . Wu, Y . Ye, H. Fei, T.-S. Chua, Y . Zhuang, and S. Tang, “Momentor: Ad- vancing video large language model with fine-grained temporal reasoning,”arXiv preprint arXiv:2402.11435, 2024
-
[48]
Grounded- videollm: Sharpening fine-grained temporal grounding in video large language models,
H. Wang, Z. Xu, Y . Cheng, S. Diao, Y . Zhou, Y . Cao, Q. Wang, W. Ge, and L. Huang, “Grounded- videollm: Sharpening fine-grained temporal grounding in video large language models,”arXiv preprint arXiv:2410.03290, 2024
-
[49]
Trace: Temporal grounding video llm via causal event modeling,
Y . Guo, J. Liu, M. Li, Q. Liu, X. Chen, and X. Tang, “Trace: Temporal grounding video llm via causal event modeling,”arXiv preprint arXiv:2410.05643, 2024
-
[50]
Enrich and detect: Video temporal grounding with multimodal llms,
S. Pramanick, E. Mavroudi, Y . Song, R. Chellappa, L. Torresani, and T. Afouras, “Enrich and detect: Video temporal grounding with multimodal llms,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 24297–24308, 2025
2025
-
[51]
Videochat-flash: Hierarchical compression for long-context video modeling,
X. Li, Y . Wang, J. Yu, X. Zeng, Y . Zhu, H. Huang, J. Gao, K. Li, Y . He, C. Wang,et al., “Videochat-flash: Hierarchical compression for long-context video modeling,”arXiv preprint arXiv:2501.00574, 2024
-
[52]
Lita: Language instructed temporal-localization assistant,
D.-A. Huang, S. Liao, S. Radhakrishnan, H. Yin, P. Molchanov, Z. Yu, and J. Kautz, “Lita: Language instructed temporal-localization assistant,”arXiv preprint arXiv:2403.19046, 2024. 12
-
[53]
LLaVA-OneVision: Easy Visual Task Transfer
B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liu,et al., “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Llava-next: Improved reasoning, ocr, and world knowledge,
H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” January 2024
2024
-
[55]
P. Zhang, X. Dong, Y . Zang, Y . Cao, R. Qian, L. Chen, Q. Guo, H. Duan, B. Wang, L. Ouyang, et al., “Internlm-xcomposer-2.5: A versatile large vision language model supporting long- contextual input and output,”arXiv preprint arXiv:2407.03320, 2024
-
[56]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
B. Lin, B. Zhu, Y . Ye, M. Ning, P. Jin, and L. Yuan, “Video-llava: Learning united visual representation by alignment before projection,”arXiv preprint arXiv:2311.10122, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Y . Yao, T. Yu, A. Zhang, C. Wang, J. Cui, H. Zhu, T. Cai, H. Li, W. Zhao, Z. He,et al., “Minicpm-v: A gpt-4v level mllm on your phone,”arXiv preprint arXiv:2408.01800, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmann,et al., “Phi-4 technical report,”arXiv preprint arXiv:2412.08905, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
Ma- lmm: Memory-augmented large multimodal model for long-term video understanding,
B. He, H. Li, Y . K. Jang, M. Jia, X. Cao, A. Shah, A. Shrivastava, and S.-N. Lim, “Ma- lmm: Memory-augmented large multimodal model for long-term video understanding,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13504– 13514, 2024
2024
-
[60]
M. Cai, J. Yang, J. Gao, and Y . J. Lee, “Matryoshka multimodal models,”arXiv preprint arXiv:2405.17430, 2024
-
[61]
Cider: Consensus-based image description evaluation,
R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575, 2015
2015
-
[62]
Bleu: a method for automatic evaluation of machine translation,
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002
2002
-
[63]
Rouge: A package for automatic evaluation of summaries,
C.-Y . Lin, “Rouge: A package for automatic evaluation of summaries,” inText summarization branches out, pp. 74–81, 2004
2004
-
[64]
Sentence-bert: Sentence embeddings using siamese bert- networks,
N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert- networks,” inProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), pp. 3982–3992, 2019
2019
-
[65]
S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge,et al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao,et al., “Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[67]
Tempcompass: Do video llms really understand videos?,
Y . Liu, S. Li, Y . Liu, Y . Wang, S. Ren, L. Li, S. Chen, X. Sun, and L. Hou, “Tempcompass: Do video llms really understand videos?,”arXiv preprint arXiv:2403.00476, 2024
-
[68]
Defining and characterizing reward gaming,
J. Skalse, N. Howe, D. Krasheninnikov, and D. Krueger, “Defining and characterizing reward gaming,”Advances in Neural Information Processing Systems, vol. 35, pp. 9460–9471, 2022
2022
-
[69]
Chain-of- thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou,et al., “Chain-of- thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24824–24837, 2022. 13 Appendix In this section, we provide details that are not included in the main manuscript due to the page limit. ...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.