pith. sign in

arxiv: 2606.27922 · v3 · pith:BVUF3E22new · submitted 2026-06-26 · 💻 cs.CV · cs.AI

Reflect-R1: Evidence-Driven Reflection for Self-Correction in Long Video Understanding

Pith reviewed 2026-07-01 06:45 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords long video understandingself-correctionevidence-driven reflectionhallucination mitigationreinforcement learningmultimodal modelsVideoMMELongVideoBench
0
0 comments X

The pith

A three-stage evidence-driven pipeline with decoupled reinforcement learning enables reliable self-correction for long video understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Reflect-R1 to solve the issue of multimodal models remaining trapped in blind confidence when reflecting on long videos because they lack objective external checks. It builds a pipeline that first forms an intuition, then verifies it by dynamically pulling visual evidence from the video, and finally arbitrates by running multiple temporal searches to settle conflicts. A stage-decoupled reinforcement learning method called SD-GRPO is introduced to train the stages independently, and a new 120K-sample dataset fills the data gap for this process. If correct, this produces higher genuine rectification rates and state-of-the-art results on VideoMME and LongVideoBench by grounding corrections strictly in retrieved evidence rather than internal parameters.

Core claim

Reflect-R1 constructs a three-stage pipeline consisting of intuition, verification, and arbitration. By dynamically retrieving objective visual evidence to verify initial intuitions and autonomously executing multiple temporal searches to resolve conflicts, it completely breaks the hallucination loop. To overcome policy coupling, a stage-decoupled reinforcement learning algorithm named SD-GRPO independently computes advantage functions across different reasoning stages. A dataset of 120K samples is constructed to bridge the training data gap, and experiments show state-of-the-art performance on VideoMME and LongVideoBench with improved genuine rectification rates.

What carries the argument

The three-stage pipeline of intuition, verification, and arbitration that uses dynamic retrieval of objective visual evidence for verification and multiple temporal searches for conflict resolution, trained via the stage-decoupled SD-GRPO algorithm.

If this is right

  • Genuine rectification rates increase because corrections are now strictly grounded in objective visual evidence rather than closed-loop internal reflection.
  • Policy coupling is avoided in multi-stage reflection pipelines because advantage functions are computed independently per stage.
  • Performance reaches state-of-the-art levels on long-video benchmarks such as VideoMME and LongVideoBench.
  • Self-correction becomes authentic and no longer relies on models simply confirming their own prior outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same evidence-retrieval approach could be tested on other multimodal tasks such as long-document or audio understanding where internal reflection also risks hallucination loops.
  • Selection bias in the temporal search step may still appear in domains where relevant evidence is sparse or ambiguously timed.
  • Scaling the 120K dataset or combining SD-GRPO with other reinforcement methods could be explored to further reduce training data requirements.

Load-bearing premise

Dynamically retrieving objective visual evidence can reliably verify initial intuitions and autonomously resolve conflicts via multiple temporal searches without introducing new errors or selection biases.

What would settle it

A controlled test on videos where the initial model output is wrong but the evidence retrieval step either misses key frames or returns frames that support the wrong answer, after which the final arbitration output remains incorrect.

Figures

Figures reproduced from arXiv: 2606.27922 by Bin Xia, Fei Ma, Jiaran Li, Shengqian Qin, Shuimu Chen, Wenming Yang, Yuanshen Guan, Yuteng Chen, Zebang Cheng, Zeyu Zhang.

Figure 1
Figure 1. Figure 1: Comparison between Internal Closed-Loop Reflection and Evidence [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Failure of Internal Reflection. Without objective external visual an￾chors, static closed-loop reflection tends to amplify initial visual hallucinations. of +2.82% on LongVideoBench and +1.41% on VideoMME. These results con￾firm that grounding self-correction in objective evidence effectively breaks the hallucination loop in long video understanding. In summary, our main contributions are as follows: –… view at source ↗
Figure 3
Figure 3. Figure 3: Decoupled training prevents policy coupling. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of SD-GRPO. We employ a progressive two-stage optimization: Stage I warms up the arbitration policy via strong KL constraints, and Stage II per￾forms full-chain joint optimization. Additionally, group-wise advantage decoupling en￾sures that each reasoning stage evolves independently. corresponds to the generation processes for y1, y2, and y3 respectively. Accord￾ingly, x (k) represents the stage-s… view at source ↗
Figure 5
Figure 5. Figure 5: Training Dynamics of Stage-Decoupled Verification. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative example of the reflective reasoning process. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: A disconnect between reasoning and response occurs during the blind verifica￾tion stage (S2) due to the absence of an abstention incentive. The model accurately identifies the lack of visual evidence during its reasoning process but is forced to make an ungrounded guess in the final output. C Dataset Construction Details [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Training data distribution across different source datasets and video durations. E Prompt Design This section details the specific prompts utilized across the system level and the three distinct reasoning stages. These prompts rigorously regulate the output for￾matting, tool invocation protocols, and the abstention mechanism of the model. This section also includes the reflection prompt used by the interna… view at source ↗
Figure 9
Figure 9. Figure 9: The system prompt used to define the model persona and tool signatures. You must conduct reasoning inside <think> and </think> first every time you call the tool or answer the question. You must invoke tools to explore any video content you are interested in within <tool_call> </tool_call> tags. You cannot answer the question directly. You are required to invoke the tool at least once to explore the video … view at source ↗
Figure 10
Figure 10. Figure 10: The intuition prompt (S1) requiring active tool exploration [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The verification prompt (S2) incorporating the abstention mechanism. The interaction history above presents two distinct analysis phases: - Phase 1 (S1): Initial Answer. (Derived from the full video and tool-retrieved frames.) - Phase 2 (S2): Blind Verification (based solely on retrieved frames from Initial Answer). Your task is to arbitrate and determine the correct answer. You must actively retrieve vid… view at source ↗
Figure 12
Figure 12. Figure 12: The arbitration prompt (S3) designed for conflict resolution. Review your previous answer critically. Analyze the evidence and logic behind your initial conclusion. You might have hallucinated or missed details in the video. You must enclose your self-reflection process within <think> and </think> tags, and your final definitive answer within <answer> and </answer> tags. Your output must strictly follow t… view at source ↗
Figure 13
Figure 13. Figure 13: Reflection prompt used by the internal closed-loop reflection baseline. F More Case Studies To further illustrate the autonomous self-correction capabilities elicited by SD￾GRPO, we present additional qualitative cases in this section. F.1 Failure Modes of Internal Closed-Loop Reflection We first present the failure modes of internal closed-loop reflection to further clarify the empirical observations dis… view at source ↗
Figure 14
Figure 14. Figure 14: Repetition of Initial Errors. During internal reflection, the model fails to generate new insights and simply repeats the erroneous reasoning of its initial intuition, trapping itself in a hallucination loop [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Overturning a Correct Initial Answer during Internal Reflection. [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Evidence-Grounded Hallucination Rectification. [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Abstention and Re-retrieval under Missing Evidence. [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Global Reconstruction after Local Reasoning Failure. [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Complete Failure. Both the intuition and verification stages arrive at the same incorrect conclusion. Lacking a conflict signal, the arbitrator adopts the shared error instead of initiating further exploration [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Arbitration Failure. The verification stage successfully identifies the correct answer, but the arbitration stage erroneously reverts to the flawed initial intuition [PITH_FULL_IMAGE:figures/full_fig_p034_20.png] view at source ↗
read the original abstract

Current multimodal reflection mechanisms for long video understanding predominantly rely on closed-loop self-reflection within internal parameters. Lacking objective external evidence, models are frequently trapped in blind confidence and often fail to correct errors. Furthermore, applying reinforcement learning to multi-stage reflection pipelines introduces severe policy coupling, which is exacerbated by a critical scarcity of dedicated training data. To address these limitations, this work proposes Reflect-R1, the first Evidence-Driven self-correction framework for long video understanding. The framework constructs a three-stage pipeline consisting of intuition, verification, and arbitration. By dynamically retrieving objective visual evidence to verify initial intuitions and autonomously executing multiple temporal searches to resolve conflicts, it completely breaks the hallucination loop. To overcome policy coupling, we design a stage-decoupled reinforcement learning algorithm named SD-GRPO that independently computes advantage functions across different reasoning stages. Concurrently, we construct a dataset of 120K samples to bridge the training data gap. Extensive experiments on benchmarks such as VideoMME and LongVideoBench demonstrate that Reflect-R1 achieves state-of-the-art performance. Our method significantly improves the genuine rectification rate and enables authentic self-correction strictly grounded in objective evidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Reflect-R1, a three-stage (intuition, verification, arbitration) evidence-driven self-correction framework for long video understanding. It dynamically retrieves objective visual evidence to verify initial intuitions and resolve conflicts via multiple temporal searches, claims to completely break hallucination loops, introduces the SD-GRPO stage-decoupled RL algorithm to address policy coupling, and releases a 120K-sample training dataset. Experiments report SOTA results on VideoMME and LongVideoBench with improved genuine rectification rates.

Significance. If the core claims hold, the work would be significant for multimodal video reasoning by replacing closed-loop internal reflection with externally grounded verification, addressing a documented failure mode in current MLLMs. The stage-decoupled RL formulation and dedicated dataset construction are practical contributions that could generalize beyond this pipeline.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (pipeline description): the assertion that the method 'completely breaks the hallucination loop' is load-bearing yet rests on the unverified premise that dynamic evidence retrieval and autonomous temporal searches introduce neither selection bias nor new errors. No mechanism is described for detecting or mitigating retrieval-induced hallucinations in the arbitration stage.
  2. [§4] §4 (SD-GRPO): while the algorithm decouples advantage estimation across stages during training, it leaves inference-time retrieval behavior unchanged; thus the arbitration stage remains exposed to any biases present in the verification-stage evidence, undermining the claim of autonomous conflict resolution.
  3. [Experimental section] Experimental section (VideoMME / LongVideoBench results): the reported SOTA gains and 'genuine rectification rate' improvements lack an ablation isolating the contribution of evidence retrieval quality versus the three-stage structure; without this, it is unclear whether the performance stems from the proposed mechanism or from stronger retrieval alone.
minor comments (2)
  1. [Dataset section] The 120K dataset construction details (filtering criteria, temporal sampling strategy, and human verification protocol) are only sketched; expanding this would strengthen reproducibility claims.
  2. [§4] Notation for the advantage function in SD-GRPO could be clarified with an explicit equation showing how stage-specific advantages are computed and combined.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify key aspects of our framework. We address each major comment point-by-point below.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (pipeline description): the assertion that the method 'completely breaks the hallucination loop' is load-bearing yet rests on the unverified premise that dynamic evidence retrieval and autonomous temporal searches introduce neither selection bias nor new errors. No mechanism is described for detecting or mitigating retrieval-induced hallucinations in the arbitration stage.

    Authors: We agree the phrasing 'completely breaks' is strong and that the manuscript does not explicitly describe a dedicated mechanism for detecting retrieval-induced errors in arbitration. The arbitration stage relies on multiple independent temporal searches to cross-verify evidence and resolve conflicts, which grounds decisions in objective visual data rather than internal parameters. To address the concern, we will revise the abstract and §3 to replace 'completely breaks' with 'substantially mitigates' and add a limitations paragraph discussing potential retrieval biases. revision: yes

  2. Referee: [§4] §4 (SD-GRPO): while the algorithm decouples advantage estimation across stages during training, it leaves inference-time retrieval behavior unchanged; thus the arbitration stage remains exposed to any biases present in the verification-stage evidence, undermining the claim of autonomous conflict resolution.

    Authors: SD-GRPO targets policy coupling exclusively during RL training by decoupling advantage estimation per stage. The inference pipeline remains the three-stage structure, where arbitration performs autonomous multiple temporal searches to resolve conflicts between intuition and verification outputs. This design ensures conflict resolution occurs at inference independently of training dynamics, preserving the external grounding claim. revision: no

  3. Referee: [Experimental section] Experimental section (VideoMME / LongVideoBench results): the reported SOTA gains and 'genuine rectification rate' improvements lack an ablation isolating the contribution of evidence retrieval quality versus the three-stage structure; without this, it is unclear whether the performance stems from the proposed mechanism or from stronger retrieval alone.

    Authors: We will add a new ablation in the experimental section comparing the full Reflect-R1 pipeline against a retrieval-only baseline (without the staged intuition-verification-arbitration structure) on both benchmarks to isolate the contribution of the three-stage mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: framework relies on external evidence and empirical validation

full rationale

The paper proposes a three-stage pipeline (intuition, verification, arbitration) that uses dynamic external visual evidence retrieval and a stage-decoupled RL algorithm (SD-GRPO), plus a constructed 120K-sample dataset. Performance is evaluated on external benchmarks (VideoMME, LongVideoBench). No equations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or derivations that reduce to inputs by construction appear in the provided text. The central claim of breaking the hallucination loop is presented as a design outcome grounded in objective evidence retrieval, not an internal self-referential loop or ansatz smuggled via prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no concrete free parameters, axioms, or invented entities are specified beyond the high-level framework description.

pith-pipeline@v0.9.1-grok · 5763 in / 1028 out tokens · 47961 ms · 2026-07-01T06:45:32.316569+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 20 canonical work pages · 11 internal anchors

  1. [1]

    arXiv preprint arXiv:2103.06224 (2021)

    Arumugam, D., Henderson, P., Bacon, P.L.: An information-theoretic perspective on credit assignment in reinforcement learning. arXiv preprint arXiv:2103.06224 (2021)

  2. [2]

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025)

  3. [3]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Chen, T.S., Siarohin, A., Menapace, W., Deyneka, E., Chao, H.w., Jeon, B.E., Fang, Y., Lee, H.Y., Ren, J., Yang, M.H., et al.: Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13320–13331 (2024)

  4. [4]

    arXiv preprint arXiv:2504.15275 (2025)

    Cheng, J., Xiong, G., Qiao, R., Li, L., Guo, C., Wang, J., Lv, Y., Wang, F.Y.: Stop summation: Min-form credit assignment is all process reward model needs for reasoning. arXiv preprint arXiv:2504.15275 (2025)

  5. [5]

    In: The Fourteenth International Conference on Learning Repre- sentations (2026)

    Chenzhaoyu, Lin, H., Nie, Y., Ma, F., Xu, X., Yu, F., Long, C.: Invert4TVG: A temporal video grounding framework with inversion tasks preserving action under- standing ability. In: The Fourteenth International Conference on Learning Repre- sentations (2026)

  6. [6]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Feng, K., Gong, K., Li, B., Guo, Z., Wang, Y., Peng, T., Wu, J., Zhang, X., Wang, B., Yue, X.: Video-r1: Reinforcing video reasoning in mllms. arXiv preprint arXiv:2503.21776 (2025)

  7. [7]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24108–24118 (2025)

  8. [8]

    In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies, Volume 2 (Industry Papers)

    Godin, F., Kumar, A., Mittal, A.: Learning when not to answer: a ternary reward structure for reinforcement learning based question answering. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies, Volume 2 (Industry Papers). pp. 122–129 (2019)

  9. [9]

    In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)

    Guo, C., He, Y., Nie, Y., Ma, F., Xu, X., Long, C.: T2sgrid: Temporal-to-spatial gridification for video temporal grounding. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). pp. 3443–3454 (June 2026)

  10. [10]

    Nature645(8081), 633–638 (2025)

    Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al.: Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature645(8081), 633–638 (2025)

  11. [11]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    He, B., Li, H., Jang, Y.K., Jia, M., Cao, X., Shah, A., Shrivastava, A., Lim, S.N.: Ma-lmm: Memory-augmented large multimodal model for long-term video under- standing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13504–13514 (2024)

  12. [12]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y., Lin, S.: Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749 (2025)

  13. [13]

    GPT-4o System Card

    Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024) Reflect-R1 17

  14. [14]

    OpenAI o1 System Card

    Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., et al.: Openai o1 system card. arXiv preprint arXiv:2412.16720 (2024)

  15. [15]

    Advances in neural information processing systems35, 22199–22213 (2022)

    Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. Advances in neural information processing systems35, 22199–22213 (2022)

  16. [16]

    arXiv preprint arXiv:2508.03100 (2025)

    Kulkarni, Y., Fazli, P.: Avatar: Reinforcement learning to see, hear, and reason over video. arXiv preprint arXiv:2508.03100 (2025)

  17. [17]

    In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Liu, S., Li, J., Zhao, G., Zhang, Y., Meng, X., Yu, F.R., Ji, X., Li, M.: Eventgpt: Event stream understanding with multimodal large language models. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 29139–29149 (June 2025)

  18. [18]

    Sequence Modeling of Temporal Credit Assignment for Episodic Reinforcement Learning

    Liu, Y., Luo, Y., Zhong, Y., Chen, X., Liu, Q., Peng, J.: Sequence modeling of temporal credit assignment for episodic reinforcement learning. arXiv preprint arXiv:1905.13420 (2019)

  19. [19]

    Advances in neural information processing systems36, 46534–46594 (2023)

    Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., et al.: Self-refine: Iterative refinement with self-feedback. Advances in neural information processing systems36, 46534–46594 (2023)

  20. [20]

    In: Proceedings of the 31st International Conference on Computational Linguistics

    Madhusudhan, N., Madhusudhan, S.T., Yadav, V., Hashemi, M.: Do llms know when to not answer? investigating abstention abilities of large language models. In: Proceedings of the 31st International Conference on Computational Linguistics. pp. 9329–9345 (2025)

  21. [21]

    arXiv preprint arXiv:2502.16863 (2025)

    Nagpal, K., Dong, D., Bouvier, J.B., Mehr, N.: Leveraging large language mod- els for effective and explainable multi-agent credit assignment. arXiv preprint arXiv:2502.16863 (2025)

  22. [22]

    arXiv preprint arXiv:2511.05489 (2025)

    Pan, J., Zhang, Q., Zhang, R., Lu, M., Wan, X., Zhang, Y., Liu, C., She, Q.: Timesearch-r: Adaptive temporal search for long-form video understanding via self-verification reinforcement learning. arXiv preprint arXiv:2511.05489 (2025)

  23. [23]

    arXiv preprint arXiv:2601.06224 (2026)

    Pan, M., Gan, W., Chen, J., Zhang, W., Sun, B., Yin, J., Zhang, X.: Ground what you see: Hallucination-resistant mllms via caption feedback, diversity-aware sampling, and conflict regularization. arXiv preprint arXiv:2601.06224 (2026)

  24. [24]

    Advances in Neural Information Processing Systems36, 42748–42761 (2023)

    Patraucean, V., Smaira, L., Gupta, A., Recasens, A., Markeeva, L., Banarse, D., Koppula, S., Malinowski, M., Yang, Y., Doersch, C., et al.: Perception test: A di- agnostic benchmark for multimodal video models. Advances in Neural Information Processing Systems36, 42748–42761 (2023)

  25. [25]

    In: 2025 IEEE International Con- ference on Multimedia and Expo (ICME)

    Pereira, J., Lopes, V., Semedo, D., Neves, J.: Self-res: Self-reflection in large vision- language models for long video understanding. In: 2025 IEEE International Con- ference on Multimedia and Expo (ICME). pp. 1–9. IEEE (2025)

  26. [26]

    Pignatelli, E., Ferret, J., Geist, M., Mesnard, T., van Hasselt, H., Pietquin, O., Toni, L.: A survey of temporal credit assignment in deep reinforcement learning (2024)

  27. [27]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

  28. [28]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Shi, Y., Yan, W., Xu, G., Li, Y., Chen, Y., Li, Z., Yu, F., Li, M., Yeo, S.Y.: Pvchat: Personalized video chat with one-shot learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 23321–23331 (October 2025) 18 S. Chen et al

  29. [29]

    Advances in neural information processing systems36, 8634–8652 (2023)

    Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., Yao, S.: Reflexion: Lan- guage agents with verbal reinforcement learning. Advances in neural information processing systems36, 8634–8652 (2023)

  30. [30]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Tang, X., Qiu, J., Xie, L., Tian, Y., Jiao, J., Ye, Q.: Adaptive keyframe sampling for long video understanding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29118–29128 (2025)

  31. [31]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Team, G., Georgiev, P., Lei, V.I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vin- cent,D.,Pan,Z.,Wang,S.,etal.:Gemini1.5:Unlockingmultimodalunderstanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)

  32. [32]

    arXiv preprint arXiv:2404.10960 (2024)

    Tomani, C., Chaudhuri, K., Evtimov, I., Cremers, D., Ibrahim, M.: Uncertainty- based abstention in llms improves safety and reduces hallucinations. arXiv preprint arXiv:2404.10960 (2024)

  33. [33]

    VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

    Wang, H., Qu, C., Huang, Z., Chu, W., Lin, F., Chen, W.: Vl-rethinker: Incen- tivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837 (2025)

  34. [34]

    arXiv preprint arXiv:2510.01132 (2025)

    Wang, R., Ammanabrolu, P.: A practitioner’s guide to multi-turn agentic reinforce- ment learning. arXiv preprint arXiv:2510.01132 (2025)

  35. [35]

    In: European Conference on Computer Vision

    Wang, X., Zhang, Y., Zohar, O., Yeung-Levy, S.: Videoagent: Long-form video understanding with large language model as agent. In: European Conference on Computer Vision. pp. 58–76. Springer (2024)

  36. [36]

    In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence

    Wang, Z., Yu, S., Stengel-Eskin, E., Yoon, J., Cheng, F., Bertasius, G., Bansal, M.: Videotree: Adaptive tree-based video representation for llm reasoning on long videos. In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence. pp. 3272–3283 (2025)

  37. [37]

    Advances in neural information processing systems35, 24824–24837 (2022)

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

  38. [38]

    TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

    Wei, Z., Yang, X., Sun, K., Wang, J., Shao, R., Chen, S., Kachuee, M., Gollapudi, T., Liao, T., Scheffer, N., et al.: Truthrl: Incentivizing truthful llms via reinforce- ment learning. arXiv preprint arXiv:2509.25760 (2025)

  39. [39]

    arXiv preprint arXiv:2405.09711 (2024)

    Wu,B.,Yu,S.,Chen,Z.,Tenenbaum,J.B.,Gan,C.:Star:Abenchmarkforsituated reasoning in real-world videos. arXiv preprint arXiv:2405.09711 (2024)

  40. [40]

    Advances in Neural Information Pro- cessing Systems37, 28828–28857 (2024)

    Wu, H., Li, D., Chen, B., Li, J.: Longvideobench: A benchmark for long-context interleaved video-language understanding. Advances in Neural Information Pro- cessing Systems37, 28828–28857 (2024)

  41. [41]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Xiao,J.,Shang,X.,Yao,A.,Chua,T.S.:Next-qa:Nextphaseofquestion-answering to explaining temporal actions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9777–9786 (2021)

  42. [42]

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React: Synergizingreasoningandactinginlanguagemodels.In:Theeleventhinternational conference on learning representations (2022)

  43. [43]

    In: CVPR

    Ye, J., Wang, Z., Sun, H., Chandrasegaran, K., Durante, Z., Eyzaguirre, C., Bisk, Y., Niebles, J.C., Adeli, E., Fei-Fei, L., et al.: Re-thinking temporal search for long-form video understanding. In: CVPR. pp. 8579–8591 (2025)

  44. [44]

    CLEVRER: CoLlision Events for Video REpresentation and Reasoning

    Yi, K., Gan, C., Li, Y., Kohli, P., Wu, J., Torralba, A., Tenenbaum, J.B.: Clevrer: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442 (2019)

  45. [45]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Zhang, Y., Liu, X., Tao, R., Chen, Q., Fei, H., Che, W., Qin, L.: Vitcot: Video-text interleaved chain-of-thought for boosting video understanding in large language models. In: Proceedings of the 33rd ACM International Conference on Multimedia. Reflect-R1 19 p. 5267–5276. MM ’25, Association for Computing Machinery, New York, NY, USA (2025)

  46. [46]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Zhang, Y., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., Li, C.: Llava-video: Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713 (2024)

  47. [47]

    Zhao, F., Tan, S., Qiu, X., Xun, L., Jiang, W., Zheng, J., Fan, H., Gao, J., Yan, D., Li, M.: Favchat: Hierarchical prompt-query guided facial video understanding with data-efficient grpo (2026)

  48. [48]

    type": "function

    Zhou, J., Shu, Y., Zhao, B., Wu, B., Liang, Z., Xiao, S., Qin, M., Yang, X., Xiong, Y., Zhang, B., et al.: Mlvu: Benchmarking multi-task long video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13691–13701 (2025) 20 S. Chen et al. Reflect-R1: Evidence-Driven Reflection for Self-Correction in Long...