pith. machine review for the scientific record. sign in

arxiv: 2511.19972 · v3 · submitted 2025-11-25 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Boosting Reasoning in Large Multimodal Models via Activation Replay

Authors on Pith no claims yet

Pith reviewed 2026-05-17 05:02 UTC · model grok-4.3

classification 💻 cs.CV
keywords activation replaylarge multimodal modelsRLVRreasoninglow-entropy activationsvisual tokenstest-time interventionlogit lens
0
0 comments X

The pith

Replaying low-entropy activations from base models to RLVR-trained multimodal models improves their reasoning at test time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how reinforcement learning with verifiable rewards alters activations inside large multimodal models, using the logit lens to track changes. It observes that this training mainly shifts low-entropy activations while leaving high-entropy ones largely untouched, and links this shift to reasoning behavior through controlled tests. From this observation the authors build Activation Replay, a method that copies low-entropy activations from the original base model back into the trained model simply by adjusting visual tokens during inference. The approach raises performance on mathematics problems, visual-agent tasks, and video reasoning without any extra training or policy optimization. It also lifts Pass@K scores and reduces the tendency of RLVR to restrict the range of reasoning paths the model explores.

Core claim

Reinforcement Learning with Verifiable Rewards shifts low-entropy activations in large multimodal models in a manner associated with reasoning performance. Activation Replay restores these activations by replaying them from the input context of the base model into the RLVR model through targeted manipulation of visual tokens at test time, resulting in stronger reasoning across mathematics, o3-style visual agents, and video tasks while increasing Pass@K and widening reasoning coverage.

What carries the argument

Activation Replay, a training-free method that manipulates visual tokens at test time to replay low-entropy activations from the input context of base large multimodal models into their RLVR-trained counterparts.

If this is right

  • Reasoning accuracy rises on mathematics problems when low-entropy activations are replayed from the base model.
  • Performance improves on o3-like visual agent tasks under the same replay procedure.
  • Video reasoning quality increases after the test-time token manipulation.
  • Pass@K metrics become higher, indicating more successful solution paths are found.
  • The narrower set of reasoning routes typical after RLVR is broadened by the replay step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Base models appear to keep activation patterns that support wider reasoning even after RLVR has narrowed them, suggesting the original patterns carry recoverable value.
  • Similar replay of specific activation ranges could be tested on other post-training techniques that alter model behavior in comparable ways.
  • If the same low-entropy shift occurs in language-only models, the replay approach might transfer without visual tokens.
  • The method opens a route to hybrid systems that combine RLVR for capability gains with periodic restoration of base-model patterns to maintain diversity.

Load-bearing premise

The shift in low-entropy activations produced by RLVR training is causally helpful for reasoning, so that restoring the original low-entropy activations from the base model will produce consistent gains.

What would settle it

A direct comparison on the same reasoning benchmarks showing that models using Activation Replay achieve no higher accuracy or Pass@K than the RLVR models alone, or that swapping in high-entropy activations instead produces equal or better results.

Figures

Figures reproduced from arXiv: 2511.19972 by Jiangning Zhang, Qingdong He, Shijian Lu, Shuicheng Yan, Xiaobin Hu, Yu-Gang Jiang, Yun Xing.

Figure 1
Figure 1. Figure 1: Lidar Plot for Performance Gains. RLVR LMMs in￾clude VL-Rethinker, DeepEyes and Video-R1 [13, 44, 63]. Our method boosts multimodal reasoning across diverse tasks consis￾tently in training-free manner. as images, videos [10], or 3-D inputs [65]. Recently, Rein￾forcement Learning with Verifiable Rewards (RLVR) [33] has been verified as an effective approach that incentivizes reasoning capability in LMMs [28… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative Case on Logit Lens. The math visual input is from [45]. Top-2 or Top-3 predictions (words) of input activa￾tions shift from base to RLVR couterpart. sual search in high-resolution images [49], and video rea￾soners that think across frames [13], where our approach showcases consistent performance gains across these se￾tups ( [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: How LMM Input Activations are Affected after RLVR. From left to right in subplots are low to high base LMM entropy. The shifts of KL divergence is normalized layerwise for illustration purpose. Brighter color suggests relatively more severe shifts on KL divergence. are presented in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Perturbation Study. We explore how low-entropy activations affects multimodal reasoning. divergence of low-entropy activations shifts less drastically from the base (left in subplots of [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of Activation Replay. Activation Replay starts with feeding the multimodal inputs to base LMMs and obtain low￾entropy input activations. For inputs to the RLVR LMM, our approach first adds zero-initialized learnable tokens to visual tokens. Then we manipulate these learnable tokens to minimize the token-level KL divergence between low-entropy activations from base LMMs and those from RLVR post-tra… view at source ↗
Figure 6
Figure 6. Figure 6: The Effect of Activation Replay on Pass@K. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Case Study. The RLVR LMM (VL-Rethinker-7B) fails to correctly reason in intermediate steps, while ours address the mistake [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 1
Figure 1. Figure 1: Case Study on Mathematical Reasoning. C. Comparisons C.1. Replay with High-Entropy Activations We apply the proposed learnable token manipulation strat￾egy to regulate high-entropy activations instead. The re￾sults are consistent with that of intervention study from Ta￾ble 1 (manuscript). As presented in [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: False Tool Call. Case Study of Multi-Turn o3-Like Agent [63]. <think>The box on the counter appears to have some text printed on it. To get a closer look at the text, we can use the image_zoom_in_tool function.</think> <tool_call> {"name": "image_zoom_in_tool", "arguments": { “bbox_2d”: [1795, 10, 2464, 200], "label": " text on the box“ } } </tool_call> <think>The text on the box is clearly visible in the … view at source ↗
Figure 3
Figure 3. Figure 3: False Recognition. Case Study of Multi-Turn o3-Like Agent [63]. C.3. Thresholding As mentioned in manuscript, we try a static thresholding to differentiate low-entropy activations from high-entropy ones. The static threshold is pre-computed by a validation set. As presented in [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Perplexity Shifts on Incorrect Responses. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
read the original abstract

Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach to incentivizing reasoning capability in Large Multimodal Models (LMMs), while the underlying mechanisms behind this post-training paradigm are poorly understood. We begin by exploring how input activations are affected by RLVR through the perspective of logit lens. Our systematic investigations across multiple post-trained LMMs suggest that RLVR shifts low-entropy activations unexpectedly, while high-entropy ones are less affected. We further demonstrate that such phenomena are associated with LMM reasoning by controlled experiments, suggesting a potentially beneficial role of modulating low-entropy activations. To this end, we propose Activation Replay, a novel simple yet effective training-free approach that boosts multimodal reasoning of post-trained LMMs without requiring expensive policy optimization. Our design involves manipulation of visual tokens at test time, replaying low-entropy activations from the input context of base LMMs to regulating the RLVR counterparts. Activation Replay triggers better reasoning across diverse scenarios, including mathematics, o3-like visual agents, and video reasoning. We further show that Activation Replay boosts Pass@K and mitigates narrower reasoning coverage of RLVR. Our design is compared against alternative choices, such as replaying high-entropy activations instead of low-entropy ones, or direct cross-model intervention instead of manipulating input tokens, demonstrating the superiority of our implementation. Code is publicly available at https://github.com/latentcraft/replay.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper examines how Reinforcement Learning with Verifiable Rewards (RLVR) alters input activations in Large Multimodal Models (LMMs) via logit-lens analysis, observing that RLVR disproportionately shifts low-entropy activations while leaving high-entropy ones relatively stable. It links this shift to reasoning performance through controlled experiments and proposes Activation Replay, a training-free test-time method that replays low-entropy activations from base LMMs into RLVR models by manipulating visual tokens. The approach is claimed to improve reasoning on mathematics, o3-like visual agents, and video tasks, increase Pass@K, and broaden reasoning coverage compared to RLVR alone, with ablations showing superiority over high-entropy replay or direct cross-model interventions.

Significance. If the empirical findings hold, the work provides a lightweight, training-free intervention for enhancing multimodal reasoning in post-trained LMMs and offers mechanistic insight into RLVR effects on activations. Public code release and systematic comparisons to alternatives are positive elements that support reproducibility and allow direct evaluation of the method.

major comments (2)
  1. [Abstract and Experiments] The central causal claim—that replaying specifically the low-entropy activations from base models improves reasoning—rests on indirect comparisons (high-entropy replay vs. low-entropy replay, and direct cross-model intervention). However, because the intervention is implemented via manipulation of visual tokens at test time, it is unclear whether gains arise from the entropy-selected values themselves or from ancillary changes in token statistics, diversity, or attention patterns. A minimal intervention that isolates only the low-entropy property while holding other input statistics fixed would be required to substantiate the mechanism.
  2. [Abstract] The manuscript reports systematic logit-lens observations and controlled experiments linking low-entropy shifts to reasoning, yet the abstract and available description lack quantitative details such as exact effect sizes, error bars, number of runs, or full methodology for the activation measurements and replay procedure. This limits verification of the reported associations and performance gains.
minor comments (2)
  1. Clarify the precise definition and selection criterion for 'low-entropy activations' (e.g., entropy threshold, layer, or token position) to enable exact reproduction.
  2. Include standard deviation or confidence intervals for all reported metrics (Pass@K, accuracy) across the diverse scenarios.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the major comments point by point below, clarifying our experimental design and committing to revisions that strengthen the presentation of quantitative results and causal evidence.

read point-by-point responses
  1. Referee: [Abstract and Experiments] The central causal claim—that replaying specifically the low-entropy activations from base models improves reasoning—rests on indirect comparisons (high-entropy replay vs. low-entropy replay, and direct cross-model intervention). However, because the intervention is implemented via manipulation of visual tokens at test time, it is unclear whether gains arise from the entropy-selected values themselves or from ancillary changes in token statistics, diversity, or attention patterns. A minimal intervention that isolates only the low-entropy property while holding other input statistics fixed would be required to substantiate the mechanism.

    Authors: We agree that tighter isolation of the low-entropy property would further strengthen the causal interpretation. Our existing controls compare low-entropy versus high-entropy replay under identical token-manipulation mechanics (same number of tokens, same visual-token replacement procedure), which already holds many ancillary statistics constant while varying only the entropy criterion. The direct cross-model baseline provides an orthogonal control without token manipulation. To address the concern more directly, we will add in revision a new ablation that selects replay candidates matched on token norm, diversity, and attention contribution but differing in entropy; this will help isolate whether performance gains track the low-entropy property specifically. We view this as a valuable addition rather than a fundamental flaw in the current design. revision: yes

  2. Referee: [Abstract] The manuscript reports systematic logit-lens observations and controlled experiments linking low-entropy shifts to reasoning, yet the abstract and available description lack quantitative details such as exact effect sizes, error bars, number of runs, or full methodology for the activation measurements and replay procedure. This limits verification of the reported associations and performance gains.

    Authors: We accept this observation. The revised abstract will now report concrete effect sizes (e.g., average accuracy gains of 4.8–12.3 percentage points across math, visual-agent, and video benchmarks), standard deviations from three independent runs, and explicit error bars on all main figures. We have also expanded the methods section and appendix with the precise logit-lens implementation (layer indices, projection details), the exact replay procedure (token selection threshold, replacement rule), and the number of runs used for all reported metrics. These additions will make the quantitative claims directly verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical test-time intervention supported by direct experiments

full rationale

The paper reports logit-lens observations of RLVR-induced activation shifts, controlled experiments linking low-entropy activations to reasoning performance, and ablations comparing low-entropy replay against high-entropy replay and cross-model baselines. No equations, derivations, or first-principles claims appear; the method is a training-free manipulation of visual tokens at inference time whose gains are measured against explicit alternative interventions. All load-bearing evidence consists of external empirical comparisons rather than self-definitional reductions, fitted parameters renamed as predictions, or self-citation chains. The derivation chain is therefore self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on empirical observation that low-entropy activations are key to reasoning and that their replay improves RLVR models; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption RLVR shifts low-entropy activations in a manner associated with improved LMM reasoning capability
    Derived from systematic logit-lens investigations and controlled experiments described in the abstract.

pith-pipeline@v0.9.0 · 5575 in / 1362 out tokens · 42941 ms · 2026-05-17T05:02:58.074573+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.

  2. Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

    cs.CV 2026-05 unverdicted novelty 5.0

    PVM adds a parallel learnable branch to LVLMs that supplies visual embeddings on demand to structurally prevent attention decay and visual signal dilution during deep autoregressive generation.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · cited by 1 Pith paper · 31 internal anchors

  1. [1]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, S ´ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024. 1

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  3. [3]

    Claude 3.5 sonnet.https : / / www

    Anthropic. Claude 3.5 sonnet.https : / / www . anthropic.com/claude, 2024. Large language model. 6

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 6

  5. [5]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R ´e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024. 2, 5

  6. [6]

    Sft or rl? an early investigation into training r1-like reasoning large vision-language models, 2025

    Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models, 2025. 1

  7. [7]

    Advancing multimodal reasoning: From opti- mized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

    Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancing multimodal reasoning: From opti- mized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025. 1, 8

  8. [8]

    Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

    Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Han- rong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025. 1, 8

  9. [9]

    Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

    Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?arXiv preprint arXiv:2505.21374, 2025. 1, 6

  10. [10]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video- llms.arXiv preprint arXiv:2406.07476, 2024. 1, 8

  11. [11]

    Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023. 1, 8

  12. [12]

    Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

    Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024. 3

  13. [13]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

  14. [14]

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. 1

  15. [15]

    Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline pro- fessional videos.arXiv preprint arXiv:2501.13826, 2025. 6

  16. [16]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 6

  17. [17]

    Interpreting and editing vision-language representations to mitigate hallucinations

    Nicholas Jiang, Anish Kachinthaya, Suzanne Petryk, and Yossi Gandelsman. Interpreting and editing vision-language representations to mitigate hallucinations. InThe Thir- teenth International Conference on Learning Representa- tions, 2025. 8

  18. [18]

    Mini-o3: Scaling up reasoning pat- terns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025

    Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning pat- terns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025. 1, 6

  19. [19]

    Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources.arXiv preprint arXiv:2509.21268, 2025

    Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Lidong Bing, et al. Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources.arXiv preprint arXiv:2509.21268, 2025. 6, 8, 1

  20. [20]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6, 8

  21. [21]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023. 8

  22. [22]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 8

  23. [23]

    Inter-gps: Interpretable geometry problem solving with formal language and sym- bolic reasoning.arXiv preprint arXiv:2105.04165, 2021

    Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and sym- bolic reasoning.arXiv preprint arXiv:2105.04165, 2021. 8

  24. [24]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 2, 6, 1

  25. [25]

    Ursa: Under- standing and verifying chain-of-thought reasoning in multi- modal mathematics.arXiv preprint arXiv:2501.04686, 2025

    Ruilin Luo, Zhuofan Zheng, Yifan Wang, Yiyao Yu, Xinzhe Ni, Zicheng Lin, Jin Zeng, and Yujiu Yang. Ursa: Under- standing and verifying chain-of-thought reasoning in multi- modal mathematics.arXiv preprint arXiv:2501.04686, 2025. 8

  26. [26]

    ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022. 8 9

  27. [27]

    A Survey of Context Engineering for Large Language Models

    Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, et al. A survey of context engineering for large lan- guage models.arXiv preprint arXiv:2507.13334, 2025. 3

  28. [28]

    Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learn- ing.CoRR, 2025

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Botian Shi, Wenhai Wang, Jun- jun He, Kaipeng Zhang, et al. Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learn- ing.CoRR, 2025. 1, 3, 4, 6, 7, 8

  29. [29]

    Interpreting gpt: The logit lens, 2020

    Neel Nanda. Interpreting gpt: The logit lens, 2020. Ac- cessed: December 1, 2025. 2, 3, 8

  30. [30]

    Towards interpreting visual in- formation processing in vision-language models

    Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez. Towards interpreting visual in- formation processing in vision-language models. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 2, 3, 8

  31. [31]

    We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

    Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large mul- timodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024. 1, 2, 4, 6

  32. [32]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 8

  33. [34]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 8

  34. [35]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 2

  35. [36]

    Zoom- eye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration

    Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, and Jianwei Yin. Zoom- eye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration. InPro- ceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6613–6629, 2025. 1, 6

  36. [37]

    Sail-rl: Guiding mllms in when and how to think via dual-reward rl tuning.arXiv preprint arXiv:2511.02280,

    Fangxun Shu, Yongjie Ye, Yue Liao, Zijian Kang, Weijie Yin, Jiacong Wang, Xiao Liang, Shuicheng Yan, and Chao Feng. Sail-rl: Guiding mllms in when and how to think via dual-reward rl tuning.arXiv preprint arXiv:2511.02280,

  37. [38]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 6

  38. [39]

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chen- zhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haoning Wu, Hao- tian Yao, Haoyu Lu, Heng Wang, Hongcheng Gao, H...

  39. [40]

    Qvq: To see the world with wisdom, 2024

    Qwen Team. Qvq: To see the world with wisdom, 2024. 6

  40. [41]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 8

  41. [42]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 8

  42. [43]

    Srpo: Enhancing multimodal llm reasoning via reflection-aware reinforcement learning.arXiv preprint arXiv:2506.01713, 2025

    Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, et al. Srpo: Enhancing multimodal llm reasoning via reflection-aware reinforcement learning.arXiv preprint arXiv:2506.01713, 2025. 2, 8

  43. [44]

    VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025. 1, 2, 4, 6, 8

  44. [45]

    Mea- suring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Sys- tems, 37:95095–95169, 2024

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Mea- suring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Sys- tems, 37:95095–95169, 2024. 1, 2, 4, 6

  45. [46]

    Skywork r1v2: Multimodal hybrid reinforcement learning for reasoning.arXiv preprint arXiv:2504.16656, 2025

    Peiyu Wang, Yichen Wei, Yi Peng, Xiaokun Wang, Wei- jie Qiu, Wei Shen, Tianyidan Xie, Jiangbo Pei, Jianhao Zhang, Yunzhuo Hao, et al. Skywork r1v2: Multimodal hybrid reinforcement learning for reasoning.arXiv preprint arXiv:2504.16656, 2025. 7

  46. [47]

    Video-thinker: Sparking” thinking with videos” via reinforcement learning.arXiv preprint arXiv:2510.23473,

    Shijian Wang, Jiarui Jin, Xingjian Wang, Linxin Song, Run- hao Fu, Hecheng Wang, Zongyuan Ge, Yuan Lu, and Xuelian Cheng. Video-thinker: Sparking” thinking with videos” via reinforcement learning.arXiv preprint arXiv:2510.23473,

  47. [48]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shix- uan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025. 2, 3, 4

  48. [49]

    Divide, conquer and combine: A training-free framework for high-resolution im- age perception in multimodal large language models.arXiv preprint, 2024

    Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution im- age perception in multimodal large language models.arXiv preprint, 2024. 2, 6, 1

  49. [50]

    V?: Guided visual search as a core mechanism in multimodal llms

    Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024. 8

  50. [51]

    LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

    Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Log- icvista: Multimodal llm logical reasoning benchmark in vi- sual contexts.arXiv preprint arXiv:2407.04973, 2024. 1, 2, 6, 7

  51. [52]

    Visual planning: Let’s think only with images.arXiv preprint arXiv:2505.11409, 2025

    Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, and Ivan Vuli´c. Visual planning: Let’s think only with images.arXiv preprint arXiv:2505.11409, 2025. 8

  52. [53]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 6, 8

  53. [54]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gao- hong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 8

  54. [55]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556– 9567, 2024. 6, 7

  55. [56]

    Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

    Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15134–15186,

  56. [57]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025. 2, 7

  57. [58]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 8

  58. [59]

    R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

    Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learn- ing to reason with multimodal large language models via step-wise group relative policy optimization.arXiv preprint arXiv:2503.12937, 2025. 8

  59. [60]

    Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, pages 169–186

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, pages 169–186. Springer, 2024. 2, 4, 6, 7, 1

  60. [61]

    Mmvu: Measuring expert-level multi- discipline video understanding

    Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, et al. Mmvu: Measuring expert-level multi- discipline video understanding. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 8475–8489, 2025. 6

  61. [62]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. 8

  62. [63]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 1, 6, 7, 8, 3

  63. [64]

    Mlvu: A comprehensive benchmark for multi- task long video understanding.arXiv e-prints, pages arXiv– 2406, 2024

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi- task long video understanding.arXiv e-prints, pages arXiv– 2406, 2024. 6

  64. [65]

    Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024

    Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024. 1, 8

  65. [66]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 1, 6

  66. [67]

    Active-o3: Empowering multimodal large language models with active perception via grpo.arXiv preprint arXiv:2505.21457, 2025

    Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Zheng Huang, Mingyu Liu, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, et al. Active-o3: Empowering multimodal large language models with active perception via grpo.arXiv preprint arXiv:2505.21457, 2025. 8

  67. [68]

    Dynamath: A dynamic visual bench- mark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836,

    Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual bench- mark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836,

  68. [69]

    Look twice before you answer: Memory-space visual retracing for hallucination mitiga- tion in multimodal large language models.arXiv preprint arXiv:2410.03577, 2024

    Xin Zou, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Ken- ing Zheng, Sirui Huang, Junkai Chen, Peijie Jiang, Jia Liu, Chang Tang, et al. Look twice before you answer: Memory-space visual retracing for hallucination mitiga- tion in multimodal large language models.arXiv preprint arXiv:2410.03577, 2024. 8 11 Boosting Reasoning in Large Multimodal Models via Activ...

  69. [71]

    - The green circle ( ) next to the sequence indicate that we need to apply one of the rules

    **Analyzing the given sequence:** - The sequence starts with: $ \square \square \star $. - The green circle ( ) next to the sequence indicate that we need to apply one of the rules

  70. [74]

    **Matching with the options:** - None of the options directly match $ \star \star \star $, but we need to re-evaluate the rules and the sequence again

  71. [75]

    Changes shading on 1st and 3rd symbol

    **Re-evaluting the rules:** - The green circle ( ) might suggest a different rule. Let's recheck: - The green circle ( ) could also suggest **Rule 2:** "Changes shading on 1st and 3rd symbol." - According to Rule 2, we need to change the shading on the 1st and 3rd symbol

  72. [76]

    - Change shading on 1st and 3rd symbol: $ \star \square \star $

    **Applying Rule 2:** - The original sequence is: $ \square \square \star $. - Change shading on 1st and 3rd symbol: $ \star \square \star $

  73. [77]

    <answer> The answer is $ \boxed{C} $

    **Matching with the options:** - The sequence $ \star \square \star $ matches with option **C**. <answer> The answer is $ \boxed{C} $. </answer> To solve this problem, we need to apply the rules given in the top section of the image to the sequence in the bottom section. Let's break it down step by step:

  74. [78]

    - **Rule 2:**Changes shading on 1st and 3rd symbol

    **Understanding the rules:** - **Rule 1:** Each symbol moves one place to the right. - **Rule 2:**Changes shading on 1st and 3rd symbol. - **Rule 3:**Reflects all symbols in X axis. - **Rule 4:**Replaced all squares with stars

  75. [79]

    - The green circle ( ) indicates that we need to apply one of the rules

    **Analyzing the given sequence:** - The sequence starts with: $ \square \square \star $. - The green circle ( ) indicates that we need to apply one of the rules

  76. [80]

    Replaced alls with stars

    **Applying the rules:** - The green circle ( ) suggests we should apply **Rule 4:** "Replaced alls with stars." - According to Rule 4, we need to replace all square ( $ \square $ ) with stars ( $ \star $)

  77. [81]

    - Replaced alls with stars: $ \star \star \star $

    **Applying Rule 4:** - The original sequence is: $ \square \square \star $. - Replaced alls with stars: $ \star \star \star $

  78. [82]

    - Let's re-evaluate the rules and the sequence: - The sequence $ \square \square \star $ might need to be transformed using Rule 1

    **Matching with the options:** - None of the options directly match $ \star \star \star $, but we need to consider the transformation rules again. - Let's re-evaluate the rules and the sequence: - The sequence $ \square \square \star $ might need to be transformed using Rule 1

  79. [83]

    - Moving each symbol one place to the right: $ \star \square \square $

    **Applying Rule 1:** - Rule 1: Each symbol moves one place to the right. - Moving each symbol one place to the right: $ \star \square \square $

  80. [84]

    - B: $ \star \square \blacksquare $

    **Re-evaluting the options:** - The options are: - A: $ \star \square \square $. - B: $ \star \square \blacksquare $. - C: $ \square \star \star $. - D: $ \square \star \square $

Showing first 80 references.