CARE: Competence-Aware Reward Shaping for Adaptive Reasoning Length in Video-MLLMs

Bin Hu; Chengwen Liu; Hao Peng; Hong Peng; Jisheng Dang; Tat-Seng Chua

arxiv: 2606.19927 · v1 · pith:FTAT7OS3new · submitted 2026-06-18 · 💻 cs.CV

CARE: Competence-Aware Reward Shaping for Adaptive Reasoning Length in Video-MLLMs

Chengwen Liu , Hao Peng , Jisheng Dang , Hong Peng , Bin Hu , Tat-Seng Chua This is my paper

Pith reviewed 2026-06-26 17:54 UTC · model grok-4.3

classification 💻 cs.CV

keywords competence-aware reward shapingadaptive reasoning lengthvideo MLLMsreinforcement learningtoken efficiencyGRPOreasoning trajectory

0 comments

The pith

CARE uses moving-average competence estimates to shift video reasoning models from long exploratory traces to short efficient ones during RL training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that fixed reasoning-length rules in reinforcement learning for video multimodal models create a mismatch: they block useful exploration early and waste tokens later once the model improves. CARE tracks competence with an exponential moving average of pass rates, then routes rewards across training stages to favor longer reasoning first and concise reasoning later. It adds batch-level normalization of effort and a posterior amplifier that boosts signals for strong answers on hard samples. When these pieces work, accuracy rises, training stabilizes, and token use drops, with reasoning length tracing an inverted-U curve before settling on shorter, more informative outputs.

Core claim

CARE maintains a smoothed competence estimate via an exponential moving average of pass rates and uses it to route training into progressive stages that shift the reward preference from exploration-oriented long-form reasoning to efficiency-oriented concise reasoning. To avoid conflating verbosity with task complexity, CARE normalizes reasoning effort with batch-level statistics and introduces a posterior amplifier to strengthen reward signals for unexpectedly strong performance on historically difficult samples. The mechanism integrates into the GRPO pipeline with no inference overhead and produces higher accuracy, more stable RL, greater token efficiency, an inverted-U reasoning-length tra

What carries the argument

Competence estimate from exponential moving average of pass rates that routes reward-shaping stages, normalizes effort at batch level, and amplifies posterior rewards on difficult samples.

If this is right

Reasoning accuracy rises on video reasoning and general video understanding benchmarks.
The reinforcement learning process becomes more stable during training.
Token consumption drops substantially while output quality holds or improves.
Reasoning length traces a characteristic inverted-U curve over the course of training.
Final converged traces are shorter yet carry more task-relevant information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same competence-routing idea could be tested on text-only or image-only multimodal reasoning tasks to check whether the inverted-U pattern generalizes.
If the batch-normalization step proves critical, replacing it with per-sample statistics might change the stability-accuracy trade-off.
Deployed systems using CARE-trained models would likely need fewer compute resources per video query once the shorter traces are reached.
A controlled study that varies only the posterior amplifier could isolate whether the boost on hard samples is what drives the final accuracy lift.

Load-bearing premise

An exponential moving average of pass rates gives a reliable, unconfounded measure of the model's intrinsic competence that can safely steer reward preferences without mixing verbosity with task difficulty.

What would settle it

Running the same GRPO pipeline with the competence router and posterior amplifier removed, then observing no gain in accuracy or token efficiency and no inverted-U pattern in reasoning length across repeated seeds, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.19927 by Bin Hu, Chengwen Liu, Hao Peng, Hong Peng, Jisheng Dang, Tat-Seng Chua.

**Figure 1.** Figure 1: Motivation of CARE. As model competence evolves during reinforcement learning, a fixed preference over [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of CARE. (A) The video-conditioned policy rollouts produce reasoning trajectories. (B) The [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Pipeline of the proposed CARE algorithm. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of the fixed-step ablation in Ta [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Optimization landscape of static length nor [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Training behavior during reinforcement learning. Left: accuracy reward. Middle: reasoning length. Right: [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Length allocation and efficiency analysis across difficulty groups and benchmarks. Left: average length of [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Hyperparameter sensitivity of CARE with respect to stage thresholds, length tolerance, and stabilizer floor. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Kernel density estimation of instance compe [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Evolution of reasoning-length distributions [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative case study under viewpoint occlusion. CARE retains the partial cue from the early view, [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

read the original abstract

In multimodal video reasoning, reinforcement learning-based methods typically rely on simplistic and inflexible reasoning-length control strategies that fail to adapt to the model's evolving competence. This mismatch may suppress necessary exploration at early stages, while encouraging redundant reasoning and inefficient decoding once the model becomes more competent. In this paper, we propose CARE, a competence-aware reward shaping framework for adaptive reasoning length optimization in multimodal reasoning. Specifically, CARE maintains a smoothed competence estimate via an exponential moving average of pass rates, and uses it to route training into progressive stages that shift the reward preference from exploration-oriented long-form reasoning to efficiency-oriented concise reasoning. To avoid conflating verbosity with intrinsic task complexity, CARE further normalizes reasoning effort with batch-level statistics, and introduces a posterior amplifier to strengthen reward signals for unexpectedly strong performance on historically difficult samples. The proposed mechanism is seamlessly integrated into the GRPO training pipeline and incurs no additional inference-time overhead. Extensive experiments on multiple video reasoning and general video understanding benchmarks demonstrate that CARE consistently improves reasoning accuracy, stabilizes reinforcement learning, and significantly enhances token efficiency. Moreover, CARE exhibits a characteristic inverted-U trajectory of reasoning length during training, and yields shorter yet more informative reasoning traces at convergence, indicating effective adaptive allocation of reasoning budget. We provide the source code for our proposed CARE framework and experiments at https://github.com/1Pansy/Video-CARE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CARE adds EMA-based staging and batch normalization to shift reward preferences in GRPO for video MLLMs, but the competence signal depends on the shaped rewards so the adaptation may not be cleanly tracking intrinsic ability.

read the letter

The main point is that this paper gives a concrete mechanism for making reasoning length adaptive during RL training of video multimodal models. CARE tracks an exponential moving average of pass rates to decide when to move from rewards that encourage longer traces to ones that favor concise ones, and it folds in batch-level normalization plus a posterior amplifier for hard samples. That combination is integrated into GRPO without changing inference cost, and the authors release the code.

The new piece is the specific routing of training stages by competence estimate plus the two supporting tricks for verbosity and sample difficulty. If the experiments hold up, the inverted-U length curve and the final shorter-but-better traces would be useful observations for anyone tuning these models.

The soft spot is the dependence between the routing variable and the reward it controls. Pass rates are produced under the shaped reward, so the EMA competence estimate is not independent; any observed efficiency gain could partly be an artifact of that loop rather than evidence of adaptive allocation. The abstract supplies no numbers, ablations, or derivation details, which leaves the strength of the accuracy and stability claims unclear. The stress-test note on circularity matches the description given.

This is aimed at groups already running GRPO-style training on video reasoning tasks who need length control. It is practical enough and grounded enough in an existing pipeline that it deserves a serious referee to examine the full experiments and check whether the circularity is mitigated in practice.

Referee Report

3 major / 2 minor

Summary. The paper introduces CARE, a competence-aware reward shaping method integrated into GRPO for video-MLLMs. It computes an exponential moving average of pass rates as a competence signal to progressively route the reward from favoring long-form exploration to concise efficiency, augmented by batch-level normalization of reasoning effort and a posterior amplifier for strong performance on hard samples. The central claims are that this yields higher reasoning accuracy, more stable RL training, improved token efficiency, an inverted-U reasoning-length trajectory during training, and shorter yet more informative traces at convergence, with no inference overhead; source code is released.

Significance. If the empirical gains and trajectory hold under the proposed mechanism, CARE would provide a practical, training-only solution to the exploration-efficiency tradeoff in RL-based multimodal reasoning, a recurring issue in current video-MLLM pipelines. The release of reproducible code is a clear strength that enables direct verification and extension.

major comments (3)

[§3.2, Eq. (3)–(5)] §3.2 (CARE mechanism) and Eq. (3)–(5): the competence estimate C_t is defined as an EMA of pass rates, yet the shaped reward R_t directly modulates the policy that produces those pass rates. The manuscript does not demonstrate that the lagged EMA breaks the feedback loop sufficiently to treat C_t as an independent measure of intrinsic competence rather than a function of the intervention itself; without an explicit independence argument or ablation that freezes the routing variable, the inverted-U trajectory and final efficiency gains could be artifacts of the closed loop.
[§4.2, Table 2] §4.2 and Table 2: the reported accuracy and token-efficiency improvements are presented without error bars across random seeds or statistical significance tests against the GRPO baseline; given that the central claim is consistent improvement and stabilization, the absence of these quantities makes it impossible to judge whether the gains exceed run-to-run variance.
[§3.3] §3.3 (posterior amplifier) and the batch-normalization step: these components are introduced to mitigate verbosity-task-complexity conflation, but no ablation isolates their contribution versus the EMA routing alone. If the amplifier or normalization is removed, does the inverted-U trajectory and accuracy gain persist? The current experimental design leaves this load-bearing design choice untested.

minor comments (2)

[Abstract] The abstract states quantitative improvements but supplies no numerical values; the results section should include at least the headline deltas (accuracy, tokens) in the abstract or first paragraph of the introduction for immediate readability.
[§3.2] Notation for the EMA smoothing factor α is introduced without a sensitivity study; a brief paragraph or appendix table showing performance for α ∈ {0.1, 0.5, 0.9} would clarify robustness.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, with plans for revisions where the concerns identify gaps in the current manuscript.

read point-by-point responses

Referee: [§3.2, Eq. (3)–(5)] §3.2 (CARE mechanism) and Eq. (3)–(5): the competence estimate C_t is defined as an EMA of pass rates, yet the shaped reward R_t directly modulates the policy that produces those pass rates. The manuscript does not demonstrate that the lagged EMA breaks the feedback loop sufficiently to treat C_t as an independent measure of intrinsic competence rather than a function of the intervention itself; without an explicit independence argument or ablation that freezes the routing variable, the inverted-U trajectory and final efficiency gains could be artifacts of the closed loop.

Authors: We acknowledge the validity of this concern about potential circularity. The EMA lag is intended to provide temporal decoupling, as C_t at step t is computed from pass rates up to t-1 and thus does not directly incorporate the immediate effect of the current shaped reward on the policy. However, the manuscript indeed lacks an explicit independence argument or a controlled ablation. In the revision we will add a short theoretical note in §3.2 explaining the lag-induced separation and include a new ablation that freezes the routing variable C to its initial value throughout training, allowing direct comparison of trajectories with and without the adaptive component. revision: yes
Referee: [§4.2, Table 2] §4.2 and Table 2: the reported accuracy and token-efficiency improvements are presented without error bars across random seeds or statistical significance tests against the GRPO baseline; given that the central claim is consistent improvement and stabilization, the absence of these quantities makes it impossible to judge whether the gains exceed run-to-run variance.

Authors: This is a fair criticism. The current results are reported from single runs, which does not permit assessment of variance. We will rerun the key experiments on the main benchmarks across at least three random seeds, report mean and standard deviation in the revised Table 2, and add pairwise statistical significance tests (e.g., paired t-tests) against the GRPO baseline to substantiate the claimed improvements. revision: yes
Referee: [§3.3] §3.3 (posterior amplifier) and the batch-normalization step: these components are introduced to mitigate verbosity-task-complexity conflation, but no ablation isolates their contribution versus the EMA routing alone. If the amplifier or normalization is removed, does the inverted-U trajectory and accuracy gain persist? The current experimental design leaves this load-bearing design choice untested.

Authors: We agree that isolating the contribution of batch normalization and the posterior amplifier is necessary. The manuscript currently presents only the full CARE configuration. In the revision we will add a dedicated ablation subsection that removes each component individually (and both together) while retaining the EMA routing, reporting accuracy, token counts, and reasoning-length trajectories for each variant. This will directly test whether the inverted-U pattern and final gains require the additional mechanisms. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain.

full rationale

The paper describes CARE as an engineering mechanism that computes an EMA of pass rates to route between exploration and efficiency rewards within the GRPO pipeline. No equations, derivations, or first-principles claims are present that reduce any prediction or result to its own inputs by construction. The competence estimate is explicitly an input to reward shaping rather than a derived output; empirical improvements and the inverted-U trajectory are reported as observed outcomes on external benchmarks. No self-citations, uniqueness theorems, or ansatzes are invoked. The feedback loop inherent to any RL reward design does not meet the enumerated circularity criteria of self-definition or fitted-input renaming.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the introduced mechanisms of EMA competence estimation, batch-level normalization, and posterior amplifier. No free parameters are numerically specified. The approach assumes pass rates reflect competence independently of verbosity.

free parameters (1)

EMA smoothing factor
Used to compute smoothed competence estimate from pass rates; specific value not stated in abstract.

axioms (2)

domain assumption Pass rates provide a valid proxy for model competence that can be used to stage reward preferences
Abstract invokes this to route training from exploration to efficiency without further justification.
domain assumption Batch-level statistics can separate verbosity from intrinsic task complexity
Abstract states this normalization avoids conflating the two but provides no supporting derivation.

pith-pipeline@v0.9.1-grok · 5787 in / 1502 out tokens · 28057 ms · 2026-06-26T17:54:39.391914+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 11 linked inside Pith

[1]

Multimodal chain-of-thought reasoning in language models,

Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola, “Multimodal chain-of-thought reasoning in language models,”arXiv preprint arXiv:2302.00923, 2023

Pith/arXiv arXiv 2023
[2]

Multimodal chain-of-thought rea- soning,

Y. Wanget al., “Multimodal chain-of-thought rea- soning,”arXiv preprint arXiv:2503.12605, 2025

Pith/arXiv arXiv 2025
[3]

Video-r1: Rein- forcing video reasoning in mllms,

K. Feng, K. Gong, B. Li, Z. Guo, Y. Wang, T. Peng, B. Wang, and X. Yue, “Video-r1: Rein- forcing video reasoning in mllms,”arXiv preprint arXiv:2503.21776, 2025

Pith/arXiv arXiv 2025
[4]

Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms,

J. Su, J. Healey, P. Nakov, and C. Cardie, “Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms,”arXiv preprint arXiv:2505.00127, 2025

arXiv 2025
[5]

Stop overthinking: A survey on efficient reasoning for large language models,

Y. Sui, Y.-N. Chuang, G. Wang, J. Zhang, T. Zhang, J. Yuan, H. Liu, A. Wen, S. Zhong, N. Zou, H. Chen, and X. Hu, “Stop overthinking: A survey on efficient reasoning for large language models,”arXiv preprint arXiv:2503.16419, 2025

Pith/arXiv arXiv 2025
[6]

Thinking fast and right: Bal- ancing accuracy and reasoning length with adaptive rewards,

J. Su and C. Cardie, “Thinking fast and right: Bal- ancing accuracy and reasoning length with adaptive rewards,”arXiv preprint arXiv:2505.18298, 2025

arXiv 2025
[7]

Learn to reason efficiently with adap- tive length-based reward shaping,

W. Liuet al., “Learn to reason efficiently with adap- tive length-based reward shaping,”arXiv preprint arXiv:2505.15612, 2025

arXiv 2025
[8]

Modality 13 gap-driven subspace alignment training paradigm for multimodal large language models,

X. Yu, Y. Xin, W. Zhang, C. Liu, H. Zhao, X. Hu, X. Yu, Z. Qiao, H. Tang, X. Yanget al., “Modality 13 gap-driven subspace alignment training paradigm for multimodal large language models,”arXiv preprint arXiv:2602.07026, 2026

Pith/arXiv arXiv 2026
[9]

Visual description grounding reduces hallucinations and boosts reason- ing in lvlms,

S. Ghosh, C. K. R. Evuru, S. Kumar, U. Tyagi, O. Ni- eto, Z. Jin, and D. Manocha, “Visual description grounding reduces hallucinations and boosts reason- ing in lvlms,”arXiv preprint arXiv:2405.15683, 2024

arXiv 2024
[10]

Unicorn: Text-only data synthesis for vision language model training,

X. Yu, P. Ding, W. Zhang, S. Huang, S. Gao, C. Qin, K. Wu, Z. Fan, Z. Qiao, and D. Wang, “Unicorn: Text-only data synthesis for vision language model training,”arXiv preprint arXiv:2503.22655, 2025

Pith/arXiv arXiv 2025
[11]

Thinking, less seeing? assessing am- plified hallucination in multimodal reasoning by rea- soning chain length and visual attention allocation,

Anonymous, “Thinking, less seeing? assessing am- plified hallucination in multimodal reasoning by rea- soning chain length and visual attention allocation,” arXiv preprint arXiv:2505.21523, 2025

arXiv 2025
[12]

Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning,

Q. Wang, Y. Yu, Y. Yuan, R. Mao, and T. Zhou, “Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning,”arXiv preprint arXiv:2505.12434, 2025

arXiv 2025
[13]

Visual-rft: Visual reinforce- ment fine-tuning,

Z. Liu, Z. Sun, Y. Zang, X. Dong, Y. Cao, H. Duan, D. Lin, and J. Wang, “Visual-rft: Visual reinforce- ment fine-tuning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 2034–2044

2025
[14]

Reason-rft: Reinforcement fine-tuning for visual reasoning,

H. Tan, Y. Ji, X. Hao, M. Lin, P. Wang, Z. Wang, and S. Zhang, “Reason-rft: Reinforcement fine-tuning for visual reasoning,”arXiv preprint arXiv:2503.20752, 2025

arXiv 2025
[15]

Point-rft: Improving multimodal reason- ing with visually grounded reinforcement finetuning,

M. Ni, Z. Yang, L. Li, C.-C. Lin, K. Lin, W. Zuo, and L. Wang, “Point-rft: Improving multimodal reason- ing with visually grounded reinforcement finetuning,” arXiv preprint arXiv:2505.19702, 2025

arXiv 2025
[16]

V-star: Benchmarking video-llms on video spatio-temporal reasoning,

Z. Chenget al., “V-star: Benchmarking video-llms on video spatio-temporal reasoning,”arXiv preprint arXiv:2503.11495, 2025

arXiv 2025
[17]

Llama-vid: An image is worth 2 tokens in large language models,

Y. Li, C. Wang, and J. Jia, “Llama-vid: An image is worth 2 tokens in large language models,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 323–340

2024
[18]

Videol- lama 2: Advancing spatial-temporal modeling and audio understanding in video-llms,

Z. Cheng, S. Leng, H. Zhang, Y. Xin, X. Li, G. Chen, Y. Zhu, W. Zhang, Z. Luo, D. Zhaoet al., “Videol- lama 2: Advancing spatial-temporal modeling and audio understanding in video-llms,”arXiv preprint arXiv:2406.07476, 2024

Pith/arXiv arXiv 2024
[19]

Long context transfer from language to vision,

P. Zhang, K. Zhang, B. Li, G. Zeng, J. Yang, Y. Zhang, Z. Wang, H. Tan, C. Li, and Z. Liu, “Long context transfer from language to vision,”arXiv preprint arXiv:2406.16852, 2024

Pith/arXiv arXiv 2024
[20]

Vila: On pre-training for visual language models,

J. Lin, H. Yin, W. Ping, P. Molchanov, M. Shoeybi, and S. Han, “Vila: On pre-training for visual language models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 26 689–26 699

2024
[21]

Unhackable temporal rewarding for scalable video mllms,

E. Yu, K. Lin, L. Zhao, Y. Wei, Z. Zhu, H. Wei, J. Sun, Z. Ge, X. Zhang, J. Wanget al., “Unhackable temporal rewarding for scalable video mllms,”arXiv preprint arXiv:2502.12081, 2025

arXiv 2025
[22]

Llava- onevision: Easy visual task transfer,

B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y. Li, Z. Liuet al., “Llava- onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024

Pith/arXiv arXiv 2024
[23]

Kangaroo: A powerful video-language model supporting long-context video input,

J. Liu, Y. Wang, H. Ma, X. Wu, X. Ma, X. Wei, J. Jiao, E. Wu, and J. Hu, “Kangaroo: A powerful video-language model supporting long-context video input,”International Journal of Computer Vision, vol. 134, no. 3, p. 114, 2026

2026
[24]

Qwen2.5- vl technical report,

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2.5- vl technical report,”arXiv preprint arXiv:2502.13923, 2025

Pith/arXiv arXiv 2025
[25]

Deepvideo-r1: Video reinforcement fine-tuning via difficulty-aware regressive grpo,

J. Park, J. Na, J. Kim, and H. J. Kim, “Deepvideo-r1: Video reinforcement fine-tuning via difficulty-aware regressive grpo,”arXiv preprint arXiv:2506.07464, 2025

arXiv 2025
[26]

Tinyllava- video-r1: Towards smaller lmms for video reasoning,

X. Zhang, S. Wen, W. Wu, and L. Huang, “Tinyllava- video-r1: Towards smaller lmms for video reasoning,” arXiv preprint arXiv:2504.09641, 2025

arXiv 2025
[27]

Reinforcement learning tuning for videollms: Reward design and data efficiency,

H. Li, S. Han, Y. Liao, J. Luo, J. Gao, S. Yan, and S. Liu, “Reinforcement learning tuning for videollms: Reward design and data efficiency,”arXiv preprint arXiv:2506.01908, 2025

arXiv 2025
[28]

Video-com: Interactive video reasoning via chain of manipulations,

H. Rasheed, M. Zumri, M. Maaz, M.-H. Yang, F. S. Khan, and S. Khan, “Video-com: Interactive video reasoning via chain of manipulations,”arXiv preprint arXiv:2511.23477, 2025

arXiv 2025
[29]

Thinking in space: How multimodal large language models see, remember, and recall spaces,

J. Yang, S. Yang, A. W. Gupta, R. Han, L. Fei-Fei, and S. Xie, “Thinking in space: How multimodal large language models see, remember, and recall spaces,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 10 632–10 643

2025
[30]

Video-mmmu: Evaluating knowl- edge acquisition from multi-discipline professional videos,

K. Hu, P. Wu, F. Pu, W. Xiao, Y. Zhang, X. Yue, B. Li, and Z. Liu, “Video-mmmu: Evaluating knowl- edge acquisition from multi-discipline professional videos,”arXiv preprint arXiv:2501.13826, 2025

Pith/arXiv arXiv 2025
[31]

Mmvu: Measur- ing expert-level multi-discipline video understanding,

Y. Zhao, H. Zhang, L. Xie, T. Hu, G. Gan, Y. Long, Z. Hu, W. Chen, C. Li, Z. Xuet al., “Mmvu: Measur- ing expert-level multi-discipline video understanding,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 8475–8489. 14

2025
[32]

Mvbench: A comprehensive multi-modal video understanding benchmark,

K. Li, Y. Wang, Y. He, Y. Li, Y. Wang, Y. Liu, Z. Wang, J. Xu, G. Chen, P. Luoet al., “Mvbench: A comprehensive multi-modal video understanding benchmark,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024, pp. 22 195–22 206

2024
[33]

Tempcompass: Do video llms really understand videos?

Y. Liu, S. Li, Y. Liu, Y. Wang, S. Ren, L. Li, S. Chen, X. Sun, and L. Hou, “Tempcompass: Do video llms really understand videos?” inFindings of the As- sociation for Computational Linguistics: ACL 2024, 2024, pp. 8731–8772

2024
[34]

Video- mme: The first-ever comprehensive evaluation bench- mark of multi-modal llms in video analysis,

C. Fu, Y. Dai, Y. Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y. Shen, M. Zhanget al., “Video- mme: The first-ever comprehensive evaluation bench- mark of multi-modal llms in video analysis,” inPro- ceedings of the Computer Vision and Pattern Recog- nition Conference, 2025, pp. 24 108–24 118. 15

2025

[1] [1]

Multimodal chain-of-thought reasoning in language models,

Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola, “Multimodal chain-of-thought reasoning in language models,”arXiv preprint arXiv:2302.00923, 2023

Pith/arXiv arXiv 2023

[2] [2]

Multimodal chain-of-thought rea- soning,

Y. Wanget al., “Multimodal chain-of-thought rea- soning,”arXiv preprint arXiv:2503.12605, 2025

Pith/arXiv arXiv 2025

[3] [3]

Video-r1: Rein- forcing video reasoning in mllms,

K. Feng, K. Gong, B. Li, Z. Guo, Y. Wang, T. Peng, B. Wang, and X. Yue, “Video-r1: Rein- forcing video reasoning in mllms,”arXiv preprint arXiv:2503.21776, 2025

Pith/arXiv arXiv 2025

[4] [4]

Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms,

J. Su, J. Healey, P. Nakov, and C. Cardie, “Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms,”arXiv preprint arXiv:2505.00127, 2025

arXiv 2025

[5] [5]

Stop overthinking: A survey on efficient reasoning for large language models,

Y. Sui, Y.-N. Chuang, G. Wang, J. Zhang, T. Zhang, J. Yuan, H. Liu, A. Wen, S. Zhong, N. Zou, H. Chen, and X. Hu, “Stop overthinking: A survey on efficient reasoning for large language models,”arXiv preprint arXiv:2503.16419, 2025

Pith/arXiv arXiv 2025

[6] [6]

Thinking fast and right: Bal- ancing accuracy and reasoning length with adaptive rewards,

J. Su and C. Cardie, “Thinking fast and right: Bal- ancing accuracy and reasoning length with adaptive rewards,”arXiv preprint arXiv:2505.18298, 2025

arXiv 2025

[7] [7]

Learn to reason efficiently with adap- tive length-based reward shaping,

W. Liuet al., “Learn to reason efficiently with adap- tive length-based reward shaping,”arXiv preprint arXiv:2505.15612, 2025

arXiv 2025

[8] [8]

Modality 13 gap-driven subspace alignment training paradigm for multimodal large language models,

X. Yu, Y. Xin, W. Zhang, C. Liu, H. Zhao, X. Hu, X. Yu, Z. Qiao, H. Tang, X. Yanget al., “Modality 13 gap-driven subspace alignment training paradigm for multimodal large language models,”arXiv preprint arXiv:2602.07026, 2026

Pith/arXiv arXiv 2026

[9] [9]

Visual description grounding reduces hallucinations and boosts reason- ing in lvlms,

S. Ghosh, C. K. R. Evuru, S. Kumar, U. Tyagi, O. Ni- eto, Z. Jin, and D. Manocha, “Visual description grounding reduces hallucinations and boosts reason- ing in lvlms,”arXiv preprint arXiv:2405.15683, 2024

arXiv 2024

[10] [10]

Unicorn: Text-only data synthesis for vision language model training,

X. Yu, P. Ding, W. Zhang, S. Huang, S. Gao, C. Qin, K. Wu, Z. Fan, Z. Qiao, and D. Wang, “Unicorn: Text-only data synthesis for vision language model training,”arXiv preprint arXiv:2503.22655, 2025

Pith/arXiv arXiv 2025

[11] [11]

Thinking, less seeing? assessing am- plified hallucination in multimodal reasoning by rea- soning chain length and visual attention allocation,

Anonymous, “Thinking, less seeing? assessing am- plified hallucination in multimodal reasoning by rea- soning chain length and visual attention allocation,” arXiv preprint arXiv:2505.21523, 2025

arXiv 2025

[12] [12]

Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning,

Q. Wang, Y. Yu, Y. Yuan, R. Mao, and T. Zhou, “Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning,”arXiv preprint arXiv:2505.12434, 2025

arXiv 2025

[13] [13]

Visual-rft: Visual reinforce- ment fine-tuning,

Z. Liu, Z. Sun, Y. Zang, X. Dong, Y. Cao, H. Duan, D. Lin, and J. Wang, “Visual-rft: Visual reinforce- ment fine-tuning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 2034–2044

2025

[14] [14]

Reason-rft: Reinforcement fine-tuning for visual reasoning,

H. Tan, Y. Ji, X. Hao, M. Lin, P. Wang, Z. Wang, and S. Zhang, “Reason-rft: Reinforcement fine-tuning for visual reasoning,”arXiv preprint arXiv:2503.20752, 2025

arXiv 2025

[15] [15]

Point-rft: Improving multimodal reason- ing with visually grounded reinforcement finetuning,

M. Ni, Z. Yang, L. Li, C.-C. Lin, K. Lin, W. Zuo, and L. Wang, “Point-rft: Improving multimodal reason- ing with visually grounded reinforcement finetuning,” arXiv preprint arXiv:2505.19702, 2025

arXiv 2025

[16] [16]

V-star: Benchmarking video-llms on video spatio-temporal reasoning,

Z. Chenget al., “V-star: Benchmarking video-llms on video spatio-temporal reasoning,”arXiv preprint arXiv:2503.11495, 2025

arXiv 2025

[17] [17]

Llama-vid: An image is worth 2 tokens in large language models,

Y. Li, C. Wang, and J. Jia, “Llama-vid: An image is worth 2 tokens in large language models,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 323–340

2024

[18] [18]

Videol- lama 2: Advancing spatial-temporal modeling and audio understanding in video-llms,

Z. Cheng, S. Leng, H. Zhang, Y. Xin, X. Li, G. Chen, Y. Zhu, W. Zhang, Z. Luo, D. Zhaoet al., “Videol- lama 2: Advancing spatial-temporal modeling and audio understanding in video-llms,”arXiv preprint arXiv:2406.07476, 2024

Pith/arXiv arXiv 2024

[19] [19]

Long context transfer from language to vision,

P. Zhang, K. Zhang, B. Li, G. Zeng, J. Yang, Y. Zhang, Z. Wang, H. Tan, C. Li, and Z. Liu, “Long context transfer from language to vision,”arXiv preprint arXiv:2406.16852, 2024

Pith/arXiv arXiv 2024

[20] [20]

Vila: On pre-training for visual language models,

J. Lin, H. Yin, W. Ping, P. Molchanov, M. Shoeybi, and S. Han, “Vila: On pre-training for visual language models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 26 689–26 699

2024

[21] [21]

Unhackable temporal rewarding for scalable video mllms,

E. Yu, K. Lin, L. Zhao, Y. Wei, Z. Zhu, H. Wei, J. Sun, Z. Ge, X. Zhang, J. Wanget al., “Unhackable temporal rewarding for scalable video mllms,”arXiv preprint arXiv:2502.12081, 2025

arXiv 2025

[22] [22]

Llava- onevision: Easy visual task transfer,

B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y. Li, Z. Liuet al., “Llava- onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024

Pith/arXiv arXiv 2024

[23] [23]

Kangaroo: A powerful video-language model supporting long-context video input,

J. Liu, Y. Wang, H. Ma, X. Wu, X. Ma, X. Wei, J. Jiao, E. Wu, and J. Hu, “Kangaroo: A powerful video-language model supporting long-context video input,”International Journal of Computer Vision, vol. 134, no. 3, p. 114, 2026

2026

[24] [24]

Qwen2.5- vl technical report,

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2.5- vl technical report,”arXiv preprint arXiv:2502.13923, 2025

Pith/arXiv arXiv 2025

[25] [25]

Deepvideo-r1: Video reinforcement fine-tuning via difficulty-aware regressive grpo,

J. Park, J. Na, J. Kim, and H. J. Kim, “Deepvideo-r1: Video reinforcement fine-tuning via difficulty-aware regressive grpo,”arXiv preprint arXiv:2506.07464, 2025

arXiv 2025

[26] [26]

Tinyllava- video-r1: Towards smaller lmms for video reasoning,

X. Zhang, S. Wen, W. Wu, and L. Huang, “Tinyllava- video-r1: Towards smaller lmms for video reasoning,” arXiv preprint arXiv:2504.09641, 2025

arXiv 2025

[27] [27]

Reinforcement learning tuning for videollms: Reward design and data efficiency,

H. Li, S. Han, Y. Liao, J. Luo, J. Gao, S. Yan, and S. Liu, “Reinforcement learning tuning for videollms: Reward design and data efficiency,”arXiv preprint arXiv:2506.01908, 2025

arXiv 2025

[28] [28]

Video-com: Interactive video reasoning via chain of manipulations,

H. Rasheed, M. Zumri, M. Maaz, M.-H. Yang, F. S. Khan, and S. Khan, “Video-com: Interactive video reasoning via chain of manipulations,”arXiv preprint arXiv:2511.23477, 2025

arXiv 2025

[29] [29]

Thinking in space: How multimodal large language models see, remember, and recall spaces,

J. Yang, S. Yang, A. W. Gupta, R. Han, L. Fei-Fei, and S. Xie, “Thinking in space: How multimodal large language models see, remember, and recall spaces,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 10 632–10 643

2025

[30] [30]

Video-mmmu: Evaluating knowl- edge acquisition from multi-discipline professional videos,

K. Hu, P. Wu, F. Pu, W. Xiao, Y. Zhang, X. Yue, B. Li, and Z. Liu, “Video-mmmu: Evaluating knowl- edge acquisition from multi-discipline professional videos,”arXiv preprint arXiv:2501.13826, 2025

Pith/arXiv arXiv 2025

[31] [31]

Mmvu: Measur- ing expert-level multi-discipline video understanding,

Y. Zhao, H. Zhang, L. Xie, T. Hu, G. Gan, Y. Long, Z. Hu, W. Chen, C. Li, Z. Xuet al., “Mmvu: Measur- ing expert-level multi-discipline video understanding,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 8475–8489. 14

2025

[32] [32]

Mvbench: A comprehensive multi-modal video understanding benchmark,

K. Li, Y. Wang, Y. He, Y. Li, Y. Wang, Y. Liu, Z. Wang, J. Xu, G. Chen, P. Luoet al., “Mvbench: A comprehensive multi-modal video understanding benchmark,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024, pp. 22 195–22 206

2024

[33] [33]

Tempcompass: Do video llms really understand videos?

Y. Liu, S. Li, Y. Liu, Y. Wang, S. Ren, L. Li, S. Chen, X. Sun, and L. Hou, “Tempcompass: Do video llms really understand videos?” inFindings of the As- sociation for Computational Linguistics: ACL 2024, 2024, pp. 8731–8772

2024

[34] [34]

Video- mme: The first-ever comprehensive evaluation bench- mark of multi-modal llms in video analysis,

C. Fu, Y. Dai, Y. Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y. Shen, M. Zhanget al., “Video- mme: The first-ever comprehensive evaluation bench- mark of multi-modal llms in video analysis,” inPro- ceedings of the Computer Vision and Pattern Recog- nition Conference, 2025, pp. 24 108–24 118. 15

2025