VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding

Arman Cohan; Lin Fu; Tingyu Song; Yang Wang; Yilun Zhao; Zheyuan Yang

arxiv: 2606.05259 · v1 · pith:PNLHBRL5new · submitted 2026-06-03 · 💻 cs.CV

VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding

Lin Fu , Zheyuan Yang , Yang Wang , Tingyu Song , Arman Cohan , Yilun Zhao This is my paper

Pith reviewed 2026-06-28 06:58 UTC · model grok-4.3

classification 💻 cs.CV

keywords video reasoningknowledge-intensive understandingdataset constructionchain-of-thought rationalesbenchmark evaluationpost-traininghuman-in-the-loop generation

0 comments

The pith

A new dataset of 315K video reasoning examples improves models on knowledge-intensive tasks while staying competitive on general benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VideoKR as a training corpus built from 145K videos and 315K examples to target deeper knowledge and reasoning in video understanding. It uses a human-in-the-loop pipeline to generate chain-of-thought rationales that emphasize progressive skill development and reliability. A companion benchmark, VideoKR-Eval, is designed so that questions cannot be solved through text alone. Experiments under standard post-training show gains on knowledge-heavy video reasoning without loss on general tasks, pointing to data construction as the main lever for progress.

Core claim

Post-training models on VideoKR produces better results on knowledge-intensive video reasoning than prior post-training methods while remaining competitive on general video reasoning, which the authors attribute to the design of the examples and their rationales.

What carries the argument

The human-in-the-loop, skill-oriented example generation pipeline that creates progressively deeper reasoning examples and reliable CoT rationales from expert-domain videos.

If this is right

Post-training on VideoKR raises accuracy on knowledge-intensive video reasoning tasks relative to earlier datasets.
The same models stay competitive on standard general video reasoning benchmarks.
Ablation studies separate the contribution of the new data from other training factors.
Data design choices, including skill progression and rationale quality, drive measurable gains in video reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pipeline could be adapted to generate training data for other complex multimodal tasks where shortcuts are common.
Scaling the number of videos while preserving the human-in-the-loop checks might further widen the gap on knowledge-heavy benchmarks.
Benchmarks like VideoKR-Eval could become standard tests for whether video models truly integrate visual and knowledge sources.

Load-bearing premise

The pipeline ensures that the chain-of-thought rationales demand genuine video understanding and external knowledge instead of allowing models to exploit textual patterns or shortcuts.

What would settle it

Models post-trained on VideoKR would show no advantage on VideoKR-Eval if the videos were replaced by static text descriptions of their content.

Figures

Figures reproduced from arXiv: 2606.05259 by Arman Cohan, Lin Fu, Tingyu Song, Yang Wang, Yilun Zhao, Zheyuan Yang.

**Figure 1.** Figure 1: An overview of the VideoKR training corpus. All videos are newly collected and CC licensed, and span a wide range of professional domains. We develop a skill oriented QA synthesis pipeline in which every example is grounded in one of three core skills essential for advanced video reasoning, and examples in the CoT subset are further paired with a high quality reasoning trace. 2025; Chen et al., 2025a), and… view at source ↗

**Figure 2.** Figure 2: (Left) Overview of data construction pipeline. (Right) Statistics of VideoKR-SFT-201K and VideoKR-RL-114K training corpus. et al., 2024b), VSI-Bench (Yang et al., 2025b), and VideoVista (Li et al., 2024c) assess perceptual skills, spatiotemporal comprehension, and cross-modal reasoning, providing a solid foundation for evaluating video understanding. Building on this trend, a growing set of knowledge-,… view at source ↗

**Figure 3.** Figure 3: Inference-time frame scaling results on general and knowledge-intensive video reasoning benchmarks. The figure shows category-wise average accuracies for Qwen2.5-VL-7BInstruct and its VideoKR post-trained variant (SFT+RL) under different input frame budgets. Appendix D.1 provides the full per-benchmark results for post-trained Qwen2.5-VL-7B-Instruct and Qwen3-VL-8B-Instruct models. intensive average of Qw… view at source ↗

**Figure 4.** Figure 4: A VideoKR-SFT-201K example from the natural science domain. The reasoning process is presented in a concise and abbreviated form to improve readability. Q: What device is physically connected to the phone before measurements begin? Answer: External speakers Reasoning: I need to examine the video carefully to identify what device is physically connected to the phone before the measurement phase begins.(...a… view at source ↗

**Figure 5.** Figure 5: A VideoKR-SFT-201K example from the healthcare domain. The reasoning process is presented in a concise and abbreviated form to improve readability. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: A VideoKR-SFT-201K example from the engineering domain. The reasoning process is presented in a concise and abbreviated form to improve readability. Q: In the illustration with a warehouse, how many helmeted figures stand in a row? Answer: D Reasoning: I examine the warehouse illustration with helmeted stick figures in the video(...abbreviated...). This illustration appears under the "Step 1: Define Proble… view at source ↗

**Figure 7.** Figure 7: A VideoKR-SFT-201K example from the engineering domain. The reasoning process is presented in a concise and abbreviated form to improve readability. Q: At around 03:32, which item is shown immediately after the text sequence ends? Answer: C Reasoning: Let me work through the timeline to understand what happens at the timestamp mentioned in the question.(...abbreviated...)Examining the sequence around that … view at source ↗

**Figure 8.** Figure 8: A VideoKR-SFT-201K example from the humanities and social science domain. The reasoning process is presented in a concise and abbreviated form to improve readability. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of model responses on a knowledge-intensive video reasoning sample. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of model responses on a knowledge-intensive video reasoning sample. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Comparison of model responses on a knowledge-intensive video reasoning sample. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

read the original abstract

We introduce VideoKR, the first large-scale training corpus specifically designed to strengthen knowledge- and reasoning-intensive video understanding. It comprises 315K video reasoning examples over 145K newly collected, CC-licensed, expert-domain videos. We develop a human-in-the-loop, skill-oriented example generation pipeline that targets progressively deeper video reasoning capabilities while ensuring the difficulty, diversity, and reliability of both the examples and their CoT rationales. We also curate VideoKR-Eval, a new expert-annotated benchmark where questions require genuine video understanding and knowledge-intensive reasoning rather than textual shortcuts. Our experiments show that, under a standard SFT$\rightarrow$GRPO pipeline, models post-trained on VideoKR outperform prior post-training approaches on knowledge-intensive video reasoning while remaining competitive on general video reasoning, highlighting data design as a key driver of progress in video reasoning. We further conduct comprehensive ablations to isolate the contributions of VideoKR, providing actionable insights for future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VideoKR adds a sizable new dataset and benchmark for knowledge-intensive video reasoning, but the link between the data design and claimed gains rests on an assumption about blocking text shortcuts that lacks the quantitative checks needed to fully support it.

read the letter

The main thing to know is that this paper releases VideoKR, a 315K-example training corpus built on 145K newly collected videos, plus the VideoKR-Eval benchmark, both aimed at pushing models toward deeper knowledge and reasoning from video rather than simpler tasks.

What stands out as new is the scale and focus. Prior video datasets have not combined this volume with an explicit target on knowledge-intensive reasoning and CoT rationales generated through a human-in-the-loop, skill-oriented pipeline. The effort to curate CC-licensed expert-domain videos and to design the generation process for progressive difficulty and reliability is a concrete step toward filling that gap. The experiments under a standard SFT to GRPO pipeline, plus the ablations, at least attempt to show that training on this data improves performance on the targeted reasoning tasks while staying competitive elsewhere.

The soft spot is the load-bearing assumption flagged in the stress-test note. The paper states that the pipeline and expert annotations on the eval set ensure questions require genuine video understanding instead of textual shortcuts, yet the abstract supplies no numbers on text-only baselines, inter-annotator checks, or similar verification. Without those, it remains possible that observed gains trace to distribution shifts or stronger text reasoning rather than video-specific knowledge. If the full manuscript includes those checks, the attribution holds up better; on the given details, the evidence for the central claim is only partially grounded.

This work is aimed at researchers building or evaluating multimodal video models, especially those focused on data curation for reasoning. Readers who need new resources or pipelines for knowledge-heavy video tasks would find usable material here. It deserves a serious referee because the new corpus and benchmark address a real gap at meaningful scale, even though the experimental reporting would likely need tightening on verification details.

Referee Report

2 major / 1 minor

Summary. The paper introduces VideoKR, a 315K-example training corpus over 145K newly collected videos, generated via a human-in-the-loop skill-oriented pipeline that produces CoT rationales, together with the expert-annotated VideoKR-Eval benchmark. Under a fixed SFT→GRPO post-training regime, models trained on VideoKR are reported to outperform prior post-training methods on knowledge-intensive video reasoning while remaining competitive on general video reasoning; comprehensive ablations are said to isolate the contributions of the data design.

Significance. If the central claims are substantiated, the work supplies a large-scale, expert-curated resource and a new benchmark that could serve as a foundation for future video-reasoning research. The emphasis on data curation as a driver of progress, together with the reported ablations, would provide concrete guidance for dataset construction in the field.

major comments (2)

[VideoKR-Eval and experiments] VideoKR-Eval description and experiments: the attribution of gains to 'genuine video understanding and knowledge-intensive reasoning rather than textual shortcuts' is load-bearing for the headline claim. No text-only baseline accuracies, caption-only results, or quantitative verification that questions cannot be solved from text/priors alone are reported, leaving the assumption that the human-in-the-loop pipeline and expert annotation enforce video necessity unsupported by evidence.
[Experiments] Experiments and ablations: while the abstract states that models 'outperform prior post-training approaches' and that 'comprehensive ablations' were conducted, the provided text supplies neither numerical metrics, specific baseline names and scores, nor details on what was ablated (e.g., data scale, rationale quality, skill categories). This absence prevents assessment of effect sizes and robustness of the central result.

minor comments (1)

[Abstract] Abstract: the phrase 'comprehensive ablations' is used without enumerating the factors varied; a short parenthetical list would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments highlight important gaps in evidence and reporting that we will address through revisions. We respond point-by-point below.

read point-by-point responses

Referee: [VideoKR-Eval and experiments] VideoKR-Eval description and experiments: the attribution of gains to 'genuine video understanding and knowledge-intensive reasoning rather than textual shortcuts' is load-bearing for the headline claim. No text-only baseline accuracies, caption-only results, or quantitative verification that questions cannot be solved from text/priors alone are reported, leaving the assumption that the human-in-the-loop pipeline and expert annotation enforce video necessity unsupported by evidence.

Authors: We agree this verification is necessary to support the central claim. In the revised manuscript we will add text-only baseline results (models evaluated on question text alone) and caption-only results on VideoKR-Eval, together with quantitative comparisons showing substantial performance drops when video content is removed. These additions will directly demonstrate that the benchmark questions require video input rather than textual shortcuts or priors. revision: yes
Referee: [Experiments] Experiments and ablations: while the abstract states that models 'outperform prior post-training approaches' and that 'comprehensive ablations' were conducted, the provided text supplies neither numerical metrics, specific baseline names and scores, nor details on what was ablated (e.g., data scale, rationale quality, skill categories). This absence prevents assessment of effect sizes and robustness of the central result.

Authors: We acknowledge that the main text should contain explicit numerical results, baseline names, scores, and ablation details for transparency. The revised version will expand the experiments section to include these: specific scores against named prior methods, effect sizes, and breakdowns of ablations on data scale, rationale quality, and skill categories. This will allow direct assessment of the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results from new data and standard training

full rationale

The paper introduces a new corpus (315K examples) and benchmark via a described human-in-the-loop pipeline, then reports empirical performance of models post-trained with standard SFT→GRPO on that data versus priors. No equations, fitted parameters, or self-citations reduce the claimed gains to inputs by construction. The VideoKR-Eval design and ablations are methodological choices whose validity is externally testable via model accuracy; they do not tautologically define the outcome. This matches the default expectation for data-centric papers with independent evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work depends on the assumption that human annotators can reliably produce reasoning chains that genuinely require video input; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Human annotators in the loop produce accurate and video-dependent chain-of-thought rationales
The pipeline description in the abstract relies on this to ensure example quality and benchmark validity.

pith-pipeline@v0.9.1-grok · 5707 in / 1246 out tokens · 23465 ms · 2026-06-28T06:58:34.323831+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 20 canonical work pages · 9 internal anchors

[1]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y ., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y ., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., and Lin, J. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Eagle 2.5: Boosting long-context post-training for frontier vision-language models, 2025a

Chen, G., Li, Z., Wang, S., Jiang, J., Liu, Y ., Lu, L., Huang, D.-A., Byeon, W., Le, M., Rintamaki, T., Poon, T., Ehrlich, M., Rintamaki, T., Poon, T., Lu, T., Wang, L., Catanzaro, B., Kautz, J., Tao, A., Yu, Z., and Liu, G. Eagle 2.5: Boosting long-context post-training for frontier vision-language models, 2025a. URL https: //arxiv.org/abs/2504.15271. C...

work page arXiv
[3]

URL https://arxiv.org/abs/ 2510.08559. Di, S. and Xie, W. Grounded question-answering in long egocentric videos. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pp. 12934–12943,

work page arXiv
[4]

Video-R1: Reinforcing Video Reasoning in MLLMs

Feng, K., Gong, K., Li, B., Guo, Z., Wang, Y ., Peng, T., Wu, J., Zhang, X., Wang, B., and Yue, X. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025a. Feng, K., Zhang, M., Li, H., Fan, K., Chen, S., Jiang, Y ., Zheng, D., Sun, P., Zhang, Y ., Sun, H., et al. Onethinker: All-in-one reasoning model for image and video.arXi...

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

URL https: //arxiv.org/abs/2405.21075. Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

URLhttps://arxiv.org/abs/2501.13826. Li, B., Zhang, P., Zhang, K., Pu, F., Du, X., Dong, Y ., Liu, H., Zhang, Y ., Zhang, G., Li, C., and Liu, Z. Lmms- eval: Accelerating the development of large multimodal models, March 2024a. URL https://github.com/ EvolvingLMMs-Lab/lmms-eval. Li, K., Wang, Y ., He, Y ., Li, Y ., Wang, Y ., Liu, Y ., Wang, Z., Xu, J., C...

work page internal anchor Pith review Pith/arXiv arXiv
[7]

doi: 10.1109/TIP.2025.3649356. Liu, S., Zhuge, M., Zhao, C., Chen, J., Wu, L., Liu, Z., Zhu, C., Cai, Z., Zhou, C., Liu, H., Chang, E., Suri, S., Xu, H., Qian, Q., Wen, W., Varadarajan, B., Liu, Z., Xu, H., Bordes, F., Krishnamoorthi, R., Ghanem, B., Chandra, V ., and Xiong, Y . Videoauto-r1: Video auto reasoning via thinking once, answering twice. 2026a....

work page doi:10.1109/tip.2025.3649356 2025
[8]

Ouyang, K., Liu, Y ., Wu, H., Liu, Y ., Zhou, H., Zhou, J., Meng, F., and Sun, X

URLhttps://arxiv.org/abs/2411.04923. Ouyang, K., Liu, Y ., Wu, H., Liu, Y ., Zhou, H., Zhou, J., Meng, F., and Sun, X. Spacer: Reinforcing mllms in video spatial reasoning,

work page arXiv
[9]

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

URL https://arxiv. org/abs/2504.01805. Plizzari, C., Tonioni, A., Xian, Y ., Kulshrestha, A., and Tombari, F. Omnia de egotempo: Benchmarking tem- poral understanding of multi-modal llms in egocentric videos

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Ren, W., Ma, W., Yang, H., Wei, C., Zhang, G., and Chen, W

URL https://arxiv.org/abs/ 2503.13646. Ren, W., Ma, W., Yang, H., Wei, C., Zhang, G., and Chen, W. Vamba: Understanding hour-long videos with hybrid mamba-transformers.arXiv preprint arXiv:2503.11579,

work page arXiv
[11]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

HybridFlow: A Flexible and Efficient RLHF Framework

Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y ., Lin, H., and Wu, C. Hybridflow: A flexi- ble and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Video-mmlu: A massive multi-discipline lecture under- standing benchmark.arXiv preprint arXiv:2504.14693,

Song, E., Chai, W., Xu, W., Xie, J., Liu, Y ., and Wang, G. Video-mmlu: A massive multi-discipline lecture under- standing benchmark.arXiv preprint arXiv:2504.14693,

work page arXiv
[14]

Video spatial reasoning with object-centric 3d rollout

Tang, H., Cao, M., Liu, R., Liang, X., Li, L., Li, G., and Liang, X. Video spatial reasoning with object-centric 3d rollout. 2025a. URL https://arxiv.org/abs/ 2511.13190. Tang, Y . Y ., Shimada, D., Hua, H., Huang, C., Bi, J., Feris, R., and Xu, C. Video-r4: Reinforcing text-rich video reasoning with visual rumination. 2025b. URL https: //arxiv.org/abs/25...

work page arXiv
[15]

Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025a

Wang, Q., Yu, Y ., Yuan, Y ., Mao, R., and Zhou, T. Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025a. Wang, S., Jin, J., Wang, X., Song, L., Fu, R., Wang, H., Ge, Z., Lu, Y ., and Cheng, X. Video-thinker: Sparking” thinking with videos” via reinforcement learning.arXiv preprint a...

work page arXiv
[16]

net/forum?id=3G1ZDXOI4f

URL https://openreview. net/forum?id=3G1ZDXOI4f. Wu, R., Ma, X., Ci, H., Fan, Y ., Wang, Y ., Zhao, H., Li, Q., and Wang, Y . Longvitu: Instruction tuning for long-form video understanding.arXiv preprint arXiv:2501.05037,

work page arXiv
[17]

arXiv preprint arXiv:2510.11606 , year=

URLhttps://arxiv.org/ abs/2510.11606. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Yang, J., Yang, S., Gupta, A. W., Han, R., Fei-Fei, L., and Xie, S. Thinking in space: How multimodal large language models see, remember, and recall spac...

work page arXiv
[18]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

URL https://arxiv. org/abs/2403.13372. Zhou, J., Shu, Y ., Zhao, B., Wu, B., Liang, Z., Xiao, S., Qin, M., Yang, X., Xiong, Y ., Zhang, B., Huang, T., and Liu, Z. Mlvu: Benchmarking multi-task long video understanding

work page internal anchor Pith review Pith/arXiv arXiv
[19]

MLVU: Benchmarking Multi-task Long Video Understanding

URL https://arxiv.org/ abs/2406.04264. Zohar, O., Wang, X., Bitton, Y ., Szpektor, I., and Yeung- Levy, S. Video-star: Self-training enables video instruc- tion tuning with any supervision,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Zohar, O., Wang, X., Dubois, Y ., Mehta, N., Xiao, T., Hansen-Estruch, P., Yu, L., Wang, X., Juefei-Xu, F., Zhang, N., et al

URL https: //arxiv.org/abs/2407.06189. Zohar, O., Wang, X., Dubois, Y ., Mehta, N., Xiao, T., Hansen-Estruch, P., Yu, L., Wang, X., Juefei-Xu, F., Zhang, N., et al. Apollo: An exploration of video under- standing in large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 18891–18901,

work page arXiv
[21]

Step 1: Define Problem

× 100 ≈ 17.2872%, round to two decimal places to get 17.29%, which corresponds to option B. A. 16.89% B. 17.29% C. 17.50% D. 18.00% Figure 6.A VideoKR-SFT-201K example from the engineering domain. The reasoning process is presented in a concise and abbreviated form to improve readability. Q: In the illustration with a warehouse, how many helmeted figures ...

2020
[22]

Deeds" as stated at around 00:18.(...abbreviated...)The title specifically uses

1 2 3 4 5 6 7 8 Figure 7.A VideoKR-SFT-201K example from the engineering domain. The reasoning process is presented in a concise and abbreviated form to improve readability. Q: At around 03:32, which item is shown immediately after the text sequence ends? Answer:C Reasoning:Let me work through the timeline to understand what happens at the timestamp menti...

1909
[23]

aha moment

as our reinforcement learning algorithm. Following the standard RLVR-style reward formulation, the total reward is defined as R= 0.1·R f + 0.9·R a, where Rf and Ra denote theformatandaccuracyrewards, respectively. Specifically, Rf is set to 1.0 if the model output strictly satisfies the required format: <think>...</think><answer>...</answer>. For the accu...

2024

[1] [1]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y ., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y ., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., and Lin, J. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Eagle 2.5: Boosting long-context post-training for frontier vision-language models, 2025a

Chen, G., Li, Z., Wang, S., Jiang, J., Liu, Y ., Lu, L., Huang, D.-A., Byeon, W., Le, M., Rintamaki, T., Poon, T., Ehrlich, M., Rintamaki, T., Poon, T., Lu, T., Wang, L., Catanzaro, B., Kautz, J., Tao, A., Yu, Z., and Liu, G. Eagle 2.5: Boosting long-context post-training for frontier vision-language models, 2025a. URL https: //arxiv.org/abs/2504.15271. C...

work page arXiv

[3] [3]

URL https://arxiv.org/abs/ 2510.08559. Di, S. and Xie, W. Grounded question-answering in long egocentric videos. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pp. 12934–12943,

work page arXiv

[4] [4]

Video-R1: Reinforcing Video Reasoning in MLLMs

Feng, K., Gong, K., Li, B., Guo, Z., Wang, Y ., Peng, T., Wu, J., Zhang, X., Wang, B., and Yue, X. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025a. Feng, K., Zhang, M., Li, H., Fan, K., Chen, S., Jiang, Y ., Zheng, D., Sun, P., Zhang, Y ., Sun, H., et al. Onethinker: All-in-one reasoning model for image and video.arXi...

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

URL https: //arxiv.org/abs/2405.21075. Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

URLhttps://arxiv.org/abs/2501.13826. Li, B., Zhang, P., Zhang, K., Pu, F., Du, X., Dong, Y ., Liu, H., Zhang, Y ., Zhang, G., Li, C., and Liu, Z. Lmms- eval: Accelerating the development of large multimodal models, March 2024a. URL https://github.com/ EvolvingLMMs-Lab/lmms-eval. Li, K., Wang, Y ., He, Y ., Li, Y ., Wang, Y ., Liu, Y ., Wang, Z., Xu, J., C...

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

doi: 10.1109/TIP.2025.3649356. Liu, S., Zhuge, M., Zhao, C., Chen, J., Wu, L., Liu, Z., Zhu, C., Cai, Z., Zhou, C., Liu, H., Chang, E., Suri, S., Xu, H., Qian, Q., Wen, W., Varadarajan, B., Liu, Z., Xu, H., Bordes, F., Krishnamoorthi, R., Ghanem, B., Chandra, V ., and Xiong, Y . Videoauto-r1: Video auto reasoning via thinking once, answering twice. 2026a....

work page doi:10.1109/tip.2025.3649356 2025

[8] [8]

Ouyang, K., Liu, Y ., Wu, H., Liu, Y ., Zhou, H., Zhou, J., Meng, F., and Sun, X

URLhttps://arxiv.org/abs/2411.04923. Ouyang, K., Liu, Y ., Wu, H., Liu, Y ., Zhou, H., Zhou, J., Meng, F., and Sun, X. Spacer: Reinforcing mllms in video spatial reasoning,

work page arXiv

[9] [9]

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

URL https://arxiv. org/abs/2504.01805. Plizzari, C., Tonioni, A., Xian, Y ., Kulshrestha, A., and Tombari, F. Omnia de egotempo: Benchmarking tem- poral understanding of multi-modal llms in egocentric videos

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Ren, W., Ma, W., Yang, H., Wei, C., Zhang, G., and Chen, W

URL https://arxiv.org/abs/ 2503.13646. Ren, W., Ma, W., Yang, H., Wei, C., Zhang, G., and Chen, W. Vamba: Understanding hour-long videos with hybrid mamba-transformers.arXiv preprint arXiv:2503.11579,

work page arXiv

[11] [11]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

HybridFlow: A Flexible and Efficient RLHF Framework

Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y ., Lin, H., and Wu, C. Hybridflow: A flexi- ble and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Video-mmlu: A massive multi-discipline lecture under- standing benchmark.arXiv preprint arXiv:2504.14693,

Song, E., Chai, W., Xu, W., Xie, J., Liu, Y ., and Wang, G. Video-mmlu: A massive multi-discipline lecture under- standing benchmark.arXiv preprint arXiv:2504.14693,

work page arXiv

[14] [14]

Video spatial reasoning with object-centric 3d rollout

Tang, H., Cao, M., Liu, R., Liang, X., Li, L., Li, G., and Liang, X. Video spatial reasoning with object-centric 3d rollout. 2025a. URL https://arxiv.org/abs/ 2511.13190. Tang, Y . Y ., Shimada, D., Hua, H., Huang, C., Bi, J., Feris, R., and Xu, C. Video-r4: Reinforcing text-rich video reasoning with visual rumination. 2025b. URL https: //arxiv.org/abs/25...

work page arXiv

[15] [15]

Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025a

Wang, Q., Yu, Y ., Yuan, Y ., Mao, R., and Zhou, T. Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025a. Wang, S., Jin, J., Wang, X., Song, L., Fu, R., Wang, H., Ge, Z., Lu, Y ., and Cheng, X. Video-thinker: Sparking” thinking with videos” via reinforcement learning.arXiv preprint a...

work page arXiv

[16] [16]

net/forum?id=3G1ZDXOI4f

URL https://openreview. net/forum?id=3G1ZDXOI4f. Wu, R., Ma, X., Ci, H., Fan, Y ., Wang, Y ., Zhao, H., Li, Q., and Wang, Y . Longvitu: Instruction tuning for long-form video understanding.arXiv preprint arXiv:2501.05037,

work page arXiv

[17] [17]

arXiv preprint arXiv:2510.11606 , year=

URLhttps://arxiv.org/ abs/2510.11606. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Yang, J., Yang, S., Gupta, A. W., Han, R., Fei-Fei, L., and Xie, S. Thinking in space: How multimodal large language models see, remember, and recall spac...

work page arXiv

[18] [18]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

URL https://arxiv. org/abs/2403.13372. Zhou, J., Shu, Y ., Zhao, B., Wu, B., Liang, Z., Xiao, S., Qin, M., Yang, X., Xiong, Y ., Zhang, B., Huang, T., and Liu, Z. Mlvu: Benchmarking multi-task long video understanding

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

MLVU: Benchmarking Multi-task Long Video Understanding

URL https://arxiv.org/ abs/2406.04264. Zohar, O., Wang, X., Bitton, Y ., Szpektor, I., and Yeung- Levy, S. Video-star: Self-training enables video instruc- tion tuning with any supervision,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Zohar, O., Wang, X., Dubois, Y ., Mehta, N., Xiao, T., Hansen-Estruch, P., Yu, L., Wang, X., Juefei-Xu, F., Zhang, N., et al

URL https: //arxiv.org/abs/2407.06189. Zohar, O., Wang, X., Dubois, Y ., Mehta, N., Xiao, T., Hansen-Estruch, P., Yu, L., Wang, X., Juefei-Xu, F., Zhang, N., et al. Apollo: An exploration of video under- standing in large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 18891–18901,

work page arXiv

[21] [21]

Step 1: Define Problem

× 100 ≈ 17.2872%, round to two decimal places to get 17.29%, which corresponds to option B. A. 16.89% B. 17.29% C. 17.50% D. 18.00% Figure 6.A VideoKR-SFT-201K example from the engineering domain. The reasoning process is presented in a concise and abbreviated form to improve readability. Q: In the illustration with a warehouse, how many helmeted figures ...

2020

[22] [22]

Deeds" as stated at around 00:18.(...abbreviated...)The title specifically uses

1 2 3 4 5 6 7 8 Figure 7.A VideoKR-SFT-201K example from the engineering domain. The reasoning process is presented in a concise and abbreviated form to improve readability. Q: At around 03:32, which item is shown immediately after the text sequence ends? Answer:C Reasoning:Let me work through the timeline to understand what happens at the timestamp menti...

1909

[23] [23]

aha moment

as our reinforcement learning algorithm. Following the standard RLVR-style reward formulation, the total reward is defined as R= 0.1·R f + 0.9·R a, where Rf and Ra denote theformatandaccuracyrewards, respectively. Specifically, Rf is set to 1.0 if the model output strictly satisfies the required format: <think>...</think><answer>...</answer>. For the accu...

2024